Materi B

Web Usage Mining: Discovery and Applications of Usage
Patterns from Web Data

Jaideep Srivastava , Robert Cooley , Mukund Deshpande, Pang-Ning Tan

Department of Computer Science and Engineering
University of Minnesota
200 Union St SE
Minneapolis, MN 55455
fsrivasta, ooley,deshpand,ptang s.umn.edu
ABSTRACT
Web usage mining is the appli ation of data mining te hniques to dis over usage patterns from Web data, in order to
understand and better serve the needs of Web-based appli ations. Web usage mining onsists of three phases, namely
prepro essing, pattern dis overy, and pattern analysis. This
paper des ribes ea h of these phases in detail. Given its appli ation potential, Web usage mining has seen a rapid in rease in interest, from both the resear h and pra ti e ommunities. This paper provides a detailed taxonomy of the
work in this area, in luding resear h eorts as well as ommer ial oerings. An up-to-date survey of the existing work
is also provided. Finally, a brief overview of the WebSIFT
system as an example of a prototypi al Web usage mining
system is given.
Keywords: data mining, world wide web, web usage mining.
1.
INTRODUCTION
The ease and speed with whi h business transa tions an

be arried out over the Web has been a key driving for e
in the rapid growth of ele troni ommer e. Spe i ally, e ommer e a tivity that involves the end user is undergoing
a signi ant revolution. The ability to tra k users' browsing
behavior down to individual mouse li ks has brought the
vendor and end ustomer loser than ever before. It is now
possible for a vendor to personalize his produ t message for
individual ustomers at a massive s ale, a phenomenon that
is being referred to as mass ustomization.
The s enario des ribed above is one of many possible appli ations of Web Usage mining, whi h is the pro ess of applying data mining te hniques to the dis overy of usage patterns
from Web data, targeted towards various appli ations. Data
mining eorts asso iated with the Web, alled Web mining,
an be broadly divided into three lasses, i.e. ontent mining, usage mining, and stru ture mining . Web Stru ture
mining proje ts su h as [34; 54 and Web Content mining
proje ts su h as [47; 21 are beyond the s ope of this survey. An early taxonomy of Web mining is provided in [29,
whi h also des ribes the ar hite ture of the WebMiner system [42, one of the rst systems for Web Usage mining. The
pro eedings of the re ent WebKDD workshop [41, held in
onjun tion with the KDD-1999 onferen e, provides a sampling of some of the urrent resear h being performed in the
area of Web Usage Analysis, in luding Web Usage mining.
This paper provides an up-to-date survey of Web Usage mining, in luding both a ademi and industrial resear h eorts,
as well as ommer ial oerings. Se tion 2 des ribes the various kinds of Web data that an be useful for Web Usage
mining. Se tion 3 dis usses the hallenges involved in dis overing usage patterns from Web data. The three phases
are prepro essing, pattern dis overy, and patterns analysis.
Se tion 4 provides a detailed taxonomy and survey of the
existing eorts in Web Usage mining, and Se tion 5 gives
an overview of the WebSIFT system [31, as a prototypi al
example of a Web Usage mining system. nally, Se tion 6
dis usses priva y on erns and Se tion 7 on ludes the paper.
2. WEB DATA
One of the key steps in Knowledge Dis overy in Databases

[33 is to reate a suitable target data set for the data mining
tasks. In Web Mining, data an be olle ted at the serverside, lient-side, proxy servers, or obtained from an organization's database (whi h ontains business data or onsolidated Web data). Ea h type of data olle tion diers not
only in terms of the lo ation of the data sour e, but also
the kinds of data available, the segment of population from
whi h the data was olle ted, and its method of implementation.
There are many kinds of data that an be used in Web Mining. This paper lassies su h data into the following types
:

Content: The real data in the Web pages, i.e. the
Stru ture: Data whi h des ribes the organization of
Supported by ARL ontra t DA/DAKF11-98-P-0359

ySupported by NSF grant EHR-9554517
1999 ACM
SIGKDD Explorations. Copyright
SIGKDD, Jan 2000.
data the Web page was designed to onvey to the users.

This usually onsists of, but is not limited to, text and
graphi s.
the ontent. Intra-page stru ture information in ludes
the arrangement of various HTML or XML tags within
a given page. This an be represented as a tree stru ture, where the hhtmli tag be omes the root of the tree.
The prin ipal kind of inter-page stru ture information
is hyper-links onne ting one page to another.
Volume 1, Issue 2 - page 1
Usage: Data that des ribes the pattern of usage of
Web pages, su h as IP addresses, page referen es, and

the date and time of a esses.
User Prole: Data that provides demographi information about users of the Web site. This in ludes
registration data and ustomer prole information.
2.1 Data Sources
The usage data olle ted at the dierent sour es will represent the navigation patterns of dierent segments of the
overall Web tra , ranging from single-user, single-site browsing behavior to multi-user, multi-site a ess patterns.
2.1.1 Server Level Collection

A Web server log is an important sour e for performing Web
Usage Mining be ause it expli itly re ords the browsing behavior of site visitors. The data re orded in server logs re e ts the (possibly on urrent) a ess of a Web site by multiple users. These log les an be stored in various formats
su h as Common log or Extended log formats. An example of Extended log format is given in Figure 2 (Se tion 3).
However, the site usage data re orded by server logs may
not be entirely reliable due to the presen e of various levels
of a hing within the Web environment. Ca hed page views
are not re orded in a server log. In addition, any important
information passed through the POST method will not be
available in a server log. Pa ket sning te hnology is an
alternative method to olle ting usage data through server
logs. Pa ket sniers monitor network tra oming to a
Web server and extra t usage data dire tly from TCP/IP
pa kets. The Web server an also store other kinds of usage
information su h as ookies and query data in separate logs.
Cookies are tokens generated by the Web server for individual lient browsers in order to automati ally tra k the site
visitors. Tra king of individual users is not an easy task
due to the stateless onne tion model of the HTTP proto ol. Cookies rely on impli it user ooperation and thus have
raised growing on erns regarding user priva y, whi h will
be dis ussed in Se tion 6. Query data is also typi ally generated by online visitors while sear hing for pages relevant
to their information needs. Besides usage data, the server
side also provides ontent data, stru ture information and
Web page meta-information (su h as the size of a le and
its last modied time).
The Web server also relies on other utilities su h as CGI
s ripts to handle data sent ba k from lient browsers. Web
servers implementing the CGI standard parse the URI1 of
the requested le to determine if it is an appli ation program. The URI for CGI programs may ontain additional
parameter values to be passed to the CGI appli ation. On e
the CGI program has ompleted its exe ution, the Web
server send the output of the CGI appli ation ba k to the
browser.
2.1.2 Client Level Collection
Client-side data olle tion an be implemented by using a remote agent (su h as Javas ripts or Java applets) or by modifying the sour e ode of an existing browser (su h as Mosai or Mozilla) to enhan e its data olle tion apabilities.
1
Uniform Resour e Identier (URI) is a more general denition that in ludes the ommonly referred to Uniform Resour e Lo ator (URL).
1999 ACM
The implementation of lient-side data olle tion methods

requires user ooperation, either in enabling the fun tionality of the Javas ripts and Java applets, or to voluntarily use
the modied browser. Client-side olle tion has an advantage over server-side olle tion be ause it ameliorates both
the a hing and session identi ation problems. However,
Java applets perform no better than server logs in terms of
determining the a tual view time of a page. In fa t, it may
in ur some additional overhead espe ially when the Java applet is loaded for the rst time. Javas ripts, on the other
hand, onsume little interpretation time but annot apture all user li ks (su h as reload or ba k buttons). These
methods will olle t only single-user, single-site browsing behavior. A modied browser is mu h more versatile and will
allow data olle tion about a single user over multiple Web
sites. The most di ult part of using this method is onvin ing the users to use the browser for their daily browsing
a tivities. This an be done by oering in entives to users
who are willing to use the browser, similar to the in entive programs oered by ompanies su h as NetZero [9 and
AllAdvantage [2 that reward users for li king on banner
advertisements while surng the Web.
2.1.3 Proxy Level Collection

A Web proxy a ts as an intermediate level of a hing between lient browsers and Web servers. Proxy a hing an
be used to redu e the loading time of a Web page experien ed by users as well as the network tra load at the
server and lient sides [27. The performan e of proxy a hes
depends on their ability to predi t future page requests orre tly. Proxy tra es may reveal the a tual HTTP requests
from multiple lients to multiple Web servers. This may
serve as a data sour e for hara terizing the browsing behavior of a group of anonymous users sharing a ommon
proxy server.
2.2 Data Abstractions
The information provided by the data sour es des ribed

above an all be used to onstru t/identify several data abstra tions, notably users, server sessions, episodes, li kstreams, and page views. In order to provide some onsisten y in the way these terms are dened, the W3C Web
Chara terization A tivity (WCA) [14 has published a draft
of Web term denitions relevant to analyzing Web usage. A
user is dened as a single individual that is a essing le
from one or more Web servers through a browser. While
this denition seems trivial, in pra ti e it is very di ult to
uniquely and repeatedly identify users. A user may a ess
the Web through dierent ma hines, or use more than one
agent on a single ma hine. A page view onsists of every le
that ontributes to the display on a user's browser at one
time. Page views are usually asso iated with a single user
a tion (su h as a mouse- li k) and an onsist of several les
su h as frames, graphi s, and s ripts. When dis ussing and
analyzing user behaviors, it is really the aggregate page view
that is of importan e. The user does not expli itly ask for
\n" frames and \m" graphi s to be loaded into his or her
browser, the user requests a \Web page." All of the information to determine whi h les onstitute a page view is
a essible from the Web server. A li k-stream is a sequential series of page view requests. Again, the data available
from the server side does not always provide enough information to re onstru t the full li k-stream for a site. Any
SIGKDD, Jan 2000.
page view a essed through a lient or proxy-level a he will

not be \visible" from the server side. A user session is the
li k-stream of page views for a singe user a ross the entire
Web. Typi ally, only the portion of ea h user session that is
a essing a spe i site an be used for analysis, sin e a ess
information is not publi ly available from the vast majority
of Web servers. The set of page-views in a user session
for a parti ular Web site is referred to as a server session
(also ommonly referred to as a visit). A set of server sessions is the ne essary input for any Web Usage analysis or
data mining tool. The end of a server session is dened as
the point when the user's browsing session at that site has
ended. Again, this is a simple on ept that is very di ult
to tra k reliably. Any semanti ally meaningful subset of a
user or server session is referred to as an episode by the W3C
WCA.
3.
WEB USAGE MINING
As shown in Figure 1, there are three main tasks for performing Web Usage Mining or Web Usage Analysis. This
se tion presents an overview of the tasks for ea h step and
dis usses the hallenges involved.
3.1 Preprocessing
Prepro essing onsists of onverting the usage, ontent, and

stru ture information ontained in the various available data
sour es into the data abstra tions ne essary for pattern dis overy.
3.1.1 Usage Preprocessing

Usage prepro essing is arguably the most di ult task in
the Web Usage Mining pro ess due to the in ompleteness of
the available data. Unless a lient side tra king me hanism
is used, only the IP address, agent, and server side li kstream are available to identify users and server sessions.
Some of the typi ally en ountered problems are:
Single IP address/Multiple Server Sessions - Internet
servi e providers (ISPs) typi ally have a pool of proxy

servers that users a ess the Web through. A single
proxy server may have several users a essing a Web
site, potentially over the same time period.
Multiple IP address/Single Server Session - Some ISPs
or priva y tools randomly assign ea h request from a

user to one of several IP addresses. In this ase, a
single server session an have multiple IP addresses.
Multiple IP address/Single User - A user that a esses
the Web from dierent ma hines will have a dierent

IP address from session to session. This makes tra king repeat visits from the same user di ult.
Multiple Agent/Singe User - Again, a user that uses
more than one browser, even on the same ma hine,

will appear as multiple users.
Assuming ea h user has now been identied (through ookies, logins, or IP/agent/path analysis), the li k-stream for
ea h user must be divided into sessions. Sin e page requests
from other servers are not typi ally available, it is di ult
to know when a user has left a Web site. A thirty minute
timeout is often used as the default method of breaking a
user's li k-stream into sessions. The thirty minute timeout
1999 ACM
is based on the results of [23. When a session ID is embedded in ea h URI, the denition of a session is set by the
ontent server.
While the exa t ontent served as a result of ea h user a tion is often available from the request eld in the server
logs, it is sometimes ne essary to have a ess to the ontent
server information as well. Sin e ontent servers an maintain state variables for ea h a tive session, the information
ne essary to determine exa tly what ontent is served by a
user request is not always available in the URI. The nal
problem en ountered when prepro essing usage data is that
of inferring a hed page referen es. As dis ussed in Se tion
2.2, the only veriable method of tra king a hed page views
is to monitor usage from the lient side. The referrer eld
for ea h request an be used to dete t some of the instan es
when a hed pages have been viewed.
Figure 2 shows a sample log that illustrates several of the
problems dis ussed above (The rst olumn would not be
present in an a tual server log, and is for illustrative purposes only). IP address 123.456.78.9 is responsible for
three server sessions, and IP addresses 209.456.78.2 and
209.45.78.3 are responsible for a fourth session. Using
a ombination of referrer and agent information, lines 1
through 11 an be divided into three sessions of A-B-F-O-G,
L-R, and A-B-C-J. Path ompletion would add two page referen es to the rst session A-B-F-O-F-B-G, and one referen e
to the third session A-B-A-C-J. Without using ookies, an
embedded session ID, or a lient-side data olle tion method,
there is no method for determining that lines 12 and 13 are
a tually a single server session.
3.1.2 Content Preprocessing

Content prepro essing onsists of onverting the text, image, s ripts, and other les su h as multimedia into forms
that are useful for the Web Usage Mining pro ess. Often,
this onsists of performing ontent mining su h as lassi ation or lustering. While applying data mining to the
ontent of Web sites is an interesting area of resear h in its
own right, in the ontext of Web Usage Mining the ontent
of a site an be used to lter the input to, or output from
the pattern dis overy algorithms. For example, results of
a lassi ation algorithm ould be used to limit the dis overed patterns to those ontaining page views about a ertain
subje t or lass of produ ts. In addition to lassifying or
lustering page views based on topi s, page views an also
be lassied a ording to their intended use [50; 30. Page
views an be intended to onvey information (through text,
graphi s, or other multimedia), gather information from the
user, allow navigation (through a list of hypertext links), or
some ombination uses. The intended use of a page view
an also lter the sessions before or after pattern dis overy.
In order to run ontent mining algorithms on page views,
the information must rst be onverted into a quantiable
format. Some version of the ve tor spa e model [51 is typi ally used to a omplish this. Text les an be broken up
into ve tors of words. Keywords or text des riptions an
be substituted for graphi s or multimedia. The ontent of
stati page views an be easily prepro essed by parsing the
HTML and reformatting the information or running additional algorithms as desired. Dynami page views present
more of a hallenge. Content servers that employ personalization te hniques and/or draw upon databases to onstru t
the page views may be apable of forming more page views
SIGKDD, Jan 2000.
Site Files
Preprocessing
Pattern Discovery
Preprocessed
Clickstream
Data
Raw Logs
Pattern Analysis
"Interesting"
Rules, Patterns,
and Statistics
Rules, Patterns,
and Statistics
Figure 1: High Level Web Usage Mining Pro ess
IP Address Userid
Time
Method/ URL/ ProtocolStatusSize Referrer

-
Agent
123.456.78.9
[25/Apr/1998:03:04:41 -0500] "GET A.html HTTP/1.0" 200 3290
123.456.78.9
[25/Apr/1998:03:05:34 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.04 (Win95, I)
123.456.78.9
[25/Apr/1998:03:05:39 -0500] "GET L.html HTTP/1.0" 200 4130
123.456.78.9
[25/Apr/1998:03:06:02 -0500] "GET F.html HTTP/1.0" 200 5096 B.html Mozilla/3.04 (Win95, I)
123.456.78.9
123.456.78.9
[25/Apr/1998:03:07:42 -0500] "GET B.html HTTP/1.0" 200 2050 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22)
123.456.78.9
[25/Apr/1998:03:07:55 -0500] "GET R.html HTTP/1.0" 2008140 L.html Mozilla/3.04 (Win95, I)
123.456.78.9
[25/Apr/1998:03:09:50 -0500] "GET C.html HTTP/1.0" 2001820 A.html Mozilla/3.01 (X11, I, IRIX6.2, IP22)
123.456.78.9
[25/Apr/1998:03:10:02 -0500] "GET O.html HTTP/1.0" 2002270 F.html Mozilla/3.04 (Win95, I)
10 123.456.78.9
[25/Apr/1998:03:10:45 -0500] "GET J.html HTTP/1.0" 2009430 C.html Mozilla/3.01 (X11, I, IRIX6.2, IP22)
11 123.456.78.9
[25/Apr/1998:03:12:23 -0500] "GET G.html HTTP/1.0" 2007220 B.html Mozilla/3.04 (Win95, I)
12 209.456.78.2
13 209.456.78.3
[25/Apr/1998:05:06:03 -0500] "GET D.html HTTP/1.0" 2001680 A.html Mozilla/3.04 (Win95, I)
Mozilla/3.04 (Win95, I)
Mozilla/3.01 (X11, I, IRIX6.2, IP22)
Figure 2: Sample Web Server Log
1999 ACM
SIGKDD, Jan 2000.
than an be pra ti ally prepro essed. A given set of server

sessions may only a ess a fra tion of the page views possible
for a large dynami site. Also the ontent may be revised
on a regular basis. The ontent of ea h page view to be prepro essed must be \assembled", either by an HTTP request
from a rawler, or a ombination of template, s ript, and
database a esses. If only the portion of page views that
are a essed are prepro essed, the output of any lassi ation or lustering algorithms may be skewed.
3.1.3 Structure Preprocessing

The stru ture of a site is reated by the hypertext links between page views. The stru ture an be obtained and prepro essed in the same manner as the ontent of a site. Again,
dynami ontent (and therefore links) pose more problems
than stati page views. A dierent site stru ture may have
to be onstru ted for ea h server session.
3.2 Pattern Discovery
Pattern dis overy draws upon methods and algorithms developed from several elds su h as statisti s, data mining,
ma hine learning and pattern re ognition. However, it is
not the intent of this paper to des ribe all the available algorithms and te hniques derived from these elds. Interested
readers should onsult referen es su h as [33; 24. This se tion des ribes the kinds of mining a tivities that have been
applied to the Web domain. Methods developed from other
elds must take into onsideration the dierent kinds of data
abstra tions and prior knowledge available for Web Mining.
For example, in asso iation rule dis overy, the notion of a
transa tion for market-basket analysis does not take into
onsideration the order in whi h items are sele ted. However, in Web Usage Mining, a server session is an ordered
sequen e of pages requested by a user. Furthermore, due to
the di ulty in identifying unique sessions, additional prior
knowledge is required (su h as imposing a default timeout
period, as was pointed out in the previous se tion).
3.2.1 Statistical Analysis

Statisti al te hniques are the most ommon method to extra t knowledge about visitors to a Web site. By analyzing
the session le, one an perform dierent kinds of des riptive statisti al analyses (frequen y, mean, median, et .) on
variables su h as page views, viewing time and length of a
navigational path. Many Web tra analysis tools produ e
a periodi report ontaining statisti al information su h as
the most frequently a essed pages, average view time of a
page or average length of a path through a site. This report
may in lude limited low-level error analysis su h as dete ting unauthorized entry points or nding the most ommon
invalid URI. Despite la king in the depth of its analysis,
this type of knowledge an be potentially useful for improving the system performan e, enhan ing the se urity of the
system, fa ilitating the site modi ation task, and providing
support for marketing de isions.
3.2.2 Association Rules
Asso iation rule generation an be used to relate pages that
are most often referen ed together in a single server session.
In the ontext of Web Usage Mining, asso iation rules refer
to sets of pages that are a essed together with a support
value ex eeding some spe ied threshold. These pages may
not be dire tly onne ted to one another via hyperlinks. For
1999 ACM
example, asso iation rule dis overy using the Apriori algorithm [18 (or one of its variants) may reveal a orrelation
between users who visited a page ontaining ele troni produ ts to those who a ess a page about sporting equipment.
Aside from being appli able for business and marketing appli ations, the presen e or absen e of su h rules an help
Web designers to restru ture their Web site. The asso iation
rules may also serve as a heuristi for prefet hing do uments
in order to redu e user-per eived laten y when loading a
page from a remote site.
3.2.3 Clustering
Clustering is a te hnique to group together a set of items
having similar hara teristi s. In the Web Usage domain,
there are two kinds of interesting lusters to be dis overed :
usage lusters and page lusters. Clustering of users tends
to establish groups of users exhibiting similar browsing patterns. Su h knowledge is espe ially useful for inferring user
demographi s in order to perform market segmentation in
E- ommer e appli ations or provide personalized Web ontent to the users. On the other hand, lustering of pages
will dis over groups of pages having related ontent. This
information is useful for Internet sear h engines and Web
assistan e providers. In both appli ations, permanent or
dynami HTML pages an be reated that suggest related
hyperlinks to the user a ording to the user's query or past
history of information needs.
3.2.4 Classification
Classi ation is the task of mapping a data item into one
of several predened lasses [33. In the Web domain, one
is interested in developing a prole of users belonging to a
parti ular lass or ategory. This requires extra tion and
sele tion of features that best des ribe the properties of a
given lass or ategory. Classi ation an be done by using
supervised indu tive learning algorithms su h as de ision
tree lassiers, naive Bayesian lassiers, k-nearest neighbor lassiers, Support Ve tor Ma hines et . For example,
lassi ation on server logs may lead to the dis overy of interesting rules su h as : 30% of users who pla ed an online
order in /Produ t/Musi are in the 18-25 age group and live
on the West Coast.
3.2.5 Sequential Patterns
The te hnique of sequential pattern dis overy attempts to
nd inter-session patterns su h that the presen e of a set of
items is followed by another item in a time-ordered set of sessions or episodes. By using this approa h, Web marketers
an predi t future visit patterns whi h will be helpful in
pla ing advertisements aimed at ertain user groups. Other
types of temporal analysis that an be performed on sequential patterns in ludes trend analysis, hange point dete tion,
or similarity analysis.
3.2.6 Dependency Modeling
Dependen y modeling is another useful pattern dis overy
task in Web Mining. The goal here is to develop a model
apable of representing signi ant dependen ies among the
various variables in the Web domain. As an example, one
may be interested to build a model representing the dierent
stages a visitor undergoes while shopping in an online store
based on the a tions hosen (ie. from a asual visitor to a serious potential buyer). There are several probabilisti learn-
SIGKDD, Jan 2000.
ing te hniques that an be employed to model the browsing

behavior of users. Su h te hniques in lude Hidden Markov
Models and Bayesian Belief Networks. Modeling of Web usage patterns will not only provide a theoreti al framework
for analyzing the behavior of users but is potentially useful
for predi ting future Web resour e onsumption. Su h information may help develop strategies to in rease the sales of
produ ts oered by the Web site or improve the navigational
onvenien e of users.
3.3 Pattern Analysis
Pattern analysis is the last step in the overall Web Usage

mining pro ess as des ribed in Figure 1. The motivation
behind pattern analysis is to lter out uninteresting rules or
patterns from the set found in the pattern dis overy phase.
The exa t analysis methodology is usually governed by the
appli ation for whi h Web mining is done. The most ommon form of pattern analysis onsists of a knowledge query
me hanism su h as SQL. Another method is to load usage
data into a data ube in order to perform OLAP operations.
Visualization te hniques, su h as graphing patterns or assigning olors to dierent values, an often highlight overall
patterns or trends in the data. Content and stru ture information an be used to lter out patterns ontaining pages
of a ertain usage type, ontent type, or pages that mat h
a ertain hyperlink stru ture.
4.
TAXONOMY AND PROJECT SURVEY
Sin e 1996 there have been several resear h proje ts and

ommer ial produ ts that have analyzed Web usage data
for a number of dierent purposes. This se tion des ribes
the dimensions and appli ation areas that an be used to
lassify Web Usage Mining proje ts.
4.1 Taxonomy Dimensions
While the number of andidate dimensions that an be used

to lassify Web Usage Mining proje ts is many, there are
ve major dimensions that apply to every proje t - the data
sour es used to gather input, the types of input data, the
number of users represented in ea h data set, the number of
Web sites represented in ea h data set, and the appli ation
area fo used on by the proje t. Usage data an either be
gathered at the server level, proxy level, or lient level, as
dis ussed in Se tion 2.1. As shown in Figure 3, most proje ts
make use of server side data. All proje ts analyze usage
data and some also make use of ontent, stru ture, or prole
data. The algorithms for a proje t an be designed to work
on inputs representing one or many users and one or many
Web sites. Single user proje ts are generally involved in the
personalization appli ation area. The proje ts that provide
multi-site analysis use either lient or proxy level input data
in order to easily a ess usage data from more than one
Web site. Most Web Usage Mining proje ts take single-site,
multi-user, server-side usage data (Web server logs) as input.
4.2 Project Survey
As shown in Figures 3 and 4, usage patterns extra ted from

Web data have been applied to a wide range of appli ations. Proje ts su h as [31; 55; 56; 58; 53 have fo used on
Web Usage Mining in general, without extensive tailoring of
the pro ess towards one of the various sub- ategories. The
WebSIFT proje t is dis ussed in more detail in the next se tion. Chen et al. [25 introdu ed the on ept of maximal
1999 ACM
forward referen e to hara terize user episodes for the mining of traversal patterns. A maximal forward referen e is the
sequen e of pages requested by a user up to the last page before ba ktra king o urs during a parti ular server session.
The SpeedTra er proje t [56 from IBM Watson is built on
the work originally reported in [25. In addition to episode
identi ation, SpeedTra er makes use of referrer and agent
information in the prepro essing routines to identify users
and server sessions in the absen e of additional lient side
information. The Web Utilization Miner (WUM) system
[55 provides a robust mining language in order to spe ify
hara teristi s of dis overed frequent paths that are interesting to the analyst. In their approa h, individual navigation
paths, alled trails, are ombined into an aggregated tree
stru ture. Queries an be answered by mapping them into
the intermediate nodes of the tree stru ture. Han et al. [58
have loaded Web server logs into a data ube stru ture in
order to perform data mining as well as On-Line Analyti al
Pro essing (OLAP) a tivities su h as roll-up and drill-down
of the data. Their WebLogMiner system has been used to
dis over asso iation rules, perform lassi ation and timeseries analysis (su h as event sequen e analysis, transition
analysis and trend analysis). Shahabi et. al. [53; 59 have
one of the few Web Usage mining systems that relies on
lient side data olle tion. The lient side agent sends ba k
page request and time information to the server every time
a page ontaining the Java applet (either a new page or a
previously a hed page) is loaded or destroyed.
4.2.1 Personalization
Personalizing the Web experien e for a user is the holy grail
of many Web-based appli ations, e.g. individualized marketing for e- ommer e [4. Making dynami re ommendations to a Web user, based on her/his prole in addition to
usage behavior is very attra tive to many appli ations, e.g.
ross-sales and up-sales in e- ommer e. Web usage mining
is an ex ellent approa h for a hieving this goal, as illustrated
in [43 Existing re ommendation systems, su h as [8; 6, do
not urrently use data mining for re ommendations, though
there have been some re ent proposals [16.
The WebWat her [37, SiteHelper [45, Letizia [39, and lustering work by Mobasher et. al. [43 and Yan et. al. [57
have all on entrated on providing Web Site personalization
based on usage information. Web server logs were used by
Yan et. al. [57 to dis over lusters of users having similar a ess patterns. The system proposed in [57 onsists
of an oine module that will perform luster analysis and
an online module whi h is responsible for dynami link generation of Web pages. Every site user will be assigned to
a single luster based on their urrent traversal pattern.
The links that are presented to a given user are dynami ally sele ted based on what pages other users assigned to
the same luster have visited. The SiteHelper proje t learns
a users preferen es by looking at the page a esses for ea h
user. A list of keywords from pages that a user has spent
a signi ant amount of time viewing is ompiled and presented to the user. Based on feedba k about the keyword
list, re ommendations for other pages within the site are
made. WebWat her \follows" a user as he or she browses
the Web and identies links that are potentially interesting
to the user. The WebWat her starts with a short des ription of a users interest. Ea h page request is routed through
the WebWat her proxy server in order to easily tra k the
SIGKDD, Jan 2000.
Project
Application
Data Source
Data
Type
User
Site
Focus
Server Proxy Client Structure ContentUsageProfileSingleMultiSingleMulti
WebSIFT (CTS99)
General
x
x
x
x
x x
SpeedTracer (WYB98,CPY96) General
x
x
WUM (SF98)
General
x
x
x
x
Shahabi (SZAS97,ZASS97)
General
x
x
x
x
Site Helper (NW97)
Personalization
x
x
x
x
Letizia (Lie95)
Personalization
x
x
x
x
Web Watcher (JFM97)
Personalization
x
x
x
x
x
Krishnapuram(NKJ99)
Personalization
x
x
Analog (YJGD96)
Personalization
x
x
Mobasher (MCS99)
Personalization
x
x
x
x
Tuzhilin(PT98)
Business
x
x
SurfAid
Business
x
x
x
x
Buchner(BM98)
Business
x
x
x
WebTrends,Hitlist,Accrue,etc. Business
x
x
WebLogMiner (ZXH98)
Business
x
x
PageGather,SCML (PE98,PE99) Site Modification x
x
x
x
x x
Manley(Man97)
Characterization
x
x
x
x
Arlitt(AW96)
Characterization
x
x
x
x
Pitkow(PIT97,PIT98)
Characterization
x
x
x
x
x
Almeida(ABC96)
Characterization
x
x
Rexford(CKR98)
System Improve.
x
x
x
x
Schechter(SKS98)
System Improve.
x
x
Aggarwal(AY97)
System Improve.
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Figure 3: Web Usage Mining Resear h Proje ts and Produ ts
Web Usage
Mining
Personalization
Site Helper
Letizia
Web Watcher
Mobasher
Analog
Krishnapuram
System
Improvement
Rexford
Schecter
Aggarwal
Site
Modification
Adaptive Sites
WebSIFT
WUM
SpeedTracer
WebLogMiner
Shahabi
Business
Intelligence
SurfAid
Buchner
Tuzhilin
Usage
Characterization
Pitkow
Arlitt
Manley
Almeida
Figure 4: Major Appli ation Areas for Web Usage Mining
1999 ACM
SIGKDD, Jan 2000.
user session a ross multiple Web sites and mark any interesting links. WebWat her learns based on the parti ular
user's browsing plus the browsing of other users with similar interests. Letizia is a lient side agent that sear hes
the Web for pages similar to ones that the user has already
viewed or bookmarked. The page re ommendations in [43
are based on lusters of pages found from the server log for a
site. The system re ommends pages from lusters that most
losely mat h the urrent session. Pages that have not been
viewed and are not dire tly linked from the urrent page are
re ommended to the user. [44 attempts to luster user sessions using a fuzzy lustering algorithm. [44 allows a page
or user to be assigned to more than one luster.
4.2.2 System Improvement

Performan e and other servi e quality attributes are ru ial
to user satisfa tion from servi es su h as databases, networks, et . Similar qualities are expe ted from the users of
Web servi es. Web usage mining provides the key to understanding Web tra behavior, whi h an in turn be used
for developing poli ies for Web a hing, network transmission [27, load balan ing, or data distribution. Se urity is an
a utely growing on ern for Web-based servi es, espe ially
as ele troni ommer e ontinues to grow at an exponential rate [32. Web usage mining an also provide patterns
whi h are useful for dete ting intrusion, fraud, attempted
break-ins, et .
Almeida et al. [19 propose models for predi ting the lo ality, both temporal as well as spatial, amongst Web pages
requested from a parti ular user or a group of users a essing from the same proxy server. The lo ality measure an
then be used for de iding pre-fet hing and a hing strategies
for the proxy server. The in reasing use of dynami ontent
has redu ed the benets of a hing at both the lient and
server level. S he hter et. al. [52 have developed algorithms
for reating path proles from data ontained in server logs.
These proles are then used to pre-generate dynami HTML
pages based on the urrent user prole in order to redu e laten y due to page generation. Using proxy information from
pre-fet hing pages has also been studied by [27 and [17.
4.2.3 Site Modification
The attra tiveness of a Web site, in terms of both ontent
and stru ture, is ru ial to many appli ations, e.g. a produ t
atalog for e- ommer e. Web usage mining provides detailed
feedba k on user behavior, providing the Web site designer
information on whi h to base redesign de isions.
While the results of any of the proje ts ould lead to redesigning the stru ture and ontent of a site, the adaptive
Web site proje t (SCML algorithm) [48; 49 fo uses on automati ally hanging the stru ture of a site based on usage
patterns dis overed from server logs. Clustering of pages is
used to determine whi h pages should be dire tly linked.
4.2.4 Business Intelligence
Information on how ustomers are using a Web site is riti al
information for marketers of e-tailing businesses. Bu hner
et al [22 have presented a knowledge dis overy pro ess in order to dis over marketing intelligen e from Web data. They
dene a Web log data hyper ube that will onsolidate Web
usage data along with marketing data for e- ommer e appli ations. They identied four distin t steps in ustomer relationship life y le that an be supported by their knowledge
1999 ACM
dis overy te hniques : ustomer attra tion, ustomer retention, ross sales and ustomer departure. There are several
ommer ial produ ts, su h as SurfAid [11, A rue [1, NetGenesis [7, Aria [3, Hitlist [5, and WebTrends [13 that
provide Web tra analysis mainly for the purpose of gathering business intelligen e. A rue, NetGenesis, and Aria
are designed to analyze e- ommer e events su h as produ ts bought and advertisement li k-through rates in addition to straight forward usage statisti s. A rue provides a
path analysis visualization tool and IBM's SurfAid provides
OLAP through a data ube and lustering of users in addition to page view statisti s. Padmanabhan et. al. [46 use
Web server logs to generate beliefs about the a ess patterns
of Web pages at a given Web site. Algorithms for nding interesting rules based on the unexpe tedness of the rule were
also developed.
4.2.5 Usage Characterization

While most proje ts that work on hara terizing the usage,
ontent, and stru ture of the Web don't ne essarily onsider themselves to be engaged in data mining, there is a
large amount of overlap between Web hara terization resear h and Web Usage mining. Catledge et al. [23 dis uss
the results of a study ondu ted at the Georgia Institute of
Te hnology, in whi h the Web browser Xmosai was modied to log lient side a tivity. The results olle ted provide
detailed information about the user's intera tion with the
browser interfa e as well as the navigational strategy used to
browse a parti ular site. The proje t also provides detailed
statisti s about o urren e of the various lient side events
su h as the li king the ba k/forward buttons, saving a le,
adding to bookmarks et . Pitkow et al. [36 propose a model
whi h an be used to predi t the probability distribution for
various pages a user might visit on a given site. This model
works by assigning a value to all the pages on a site based on
various attributes of that page. The formulas and threshold values used in the model are derived from an extensive
empiri al study arried out on various browsing ommunities and their browsing patterns Arlitt et. al. [20 dis uss
various performan e metri s for Web servers along with details about the relationship between ea h of these metri s
for dierent workloads. Manley [40 develops a te hnique
for generating a ustom made ben hmark for a given site
based on its urrent workload. This ben hmark, whi h he
alls a self- onguring ben hmark, an be used to perform
s alability and load balan ing studies on a Web server. Chi
et. al. [35 des ribe a system alled WEEV (Web E ology
and Evolution Visualization) whi h is a visualization tool to
study the evolving relationship of web usage, ontent and
site topology with respe t to time.
5. WEBSIFT OVERVIEW
The WebSIFT system [31 is designed to perform Web Usage Mining from server logs in the extended NSCA format
(in ludes referrer and agent elds). The prepro essing algorithms in lude identifying users, server sessions, and inferring a hed page referen es through the use of the referrer
eld. The details of the algorithms used for these steps are
ontained in [30. In addition to reating a server session
le, the WebSIFT system performs ontent and stru ture
prepro essing, and provides the option to onvert server sessions into episodes. Ea h episode is either the subset of all
ontent pages in a server session, or all of the navigation
SIGKDD, Jan 2000.
pages up to and in luding ea h ontent page. Several algorithms for identifying episodes (referred to as transa tions
in the paper) are des ribed and evaluated in [28.
The server session or episode les an be run through sequential pattern analysis, asso iation rule dis overy, lustering, or general statisti s algorithms, as shown in Figure 5.
The results of the various knowledge dis overy tools an be
analyzed through a simple knowledge query me hanism, a
visualization tool (asso iation rule map with onden e and
support weighted edges), or the information lter (OLAP
tools su h as a data ube are possible as shown in Figure 5,
but are not urrently implemented). The information lter
makes use of the prepro essed ontent and stru ture information to automati ally lter the results of the knowledge
dis overy algorithms for patterns that are potentially interesting. For example, usage lusters that ontain page views
from multiple ontent lusters are potentially interesting,
whereas usage lusters that mat h ontent lusters may not
be interesting. The details of the method the information lter uses to ombine and ompare eviden e from the dierent
data sour es are ontained in [31.
6.
PRIVACY ISSUES
Priva y is a sensitive topi whi h has been attra ting a lot of

attention re ently due to rapid growth of e- ommer e. It is
further ompli ated by the global and self-regulatory nature
of the Web. The issue of priva y revolves around the fa t
that most users want to maintain stri t anonymity on the
Web. They are extremely averse to the idea that someone is
monitoring the Web sites they visit and the time they spend
on those sites.
On the other hand, site administrators are interested in nding out the demographi s of users as well as the usage statisti s of dierent se tions of their Web site. This information
would allow them to improve the design of the Web site and
would ensure that the ontent aters to the largest population of users visiting their site. The site administrators
also want the ability to identify a user uniquely every time
she visits the site, in order to personalize the Web site and
improve the browsing experien e.
The main hallenge is to ome up with guidelines and rules
su h that site administrators an perform various analyses
on the usage data without ompromising the identity of an
individual user. Furthermore, there should be stri t regulations to prevent the usage data from being ex hanged/sold
to other sites. The users should be made aware of the priva y poli ies followed by any given site, so that they an
make an informed de ision about revealing their personal
data. The su ess of any su h guidelines an only be guaranteed if they are ba ked up by a legal framework.
The W3C has an ongoing initiative alled Platform for Priva y Preferen es (P3P) [10; 38. P3P provides a proto ol
whi h allows the site administrators to publish the priva y
poli ies followed by a site in a ma hine readable format.
When the user visits the site for the rst time the browser
reads the priva y poli ies followed by the site and then ompares that with that se urity setting ongured by the user.
If the poli ies are satisfa tory the browser ontinues requesting pages from the site, otherwise a negotiation proto ol is
used to arrive at a setting whi h is a eptable to the user.
Another aim of P3P is to provide guidelines for independent
organizations whi h an ensure that sites omply with the
1999 ACM
poli y statement they are publishing [12.

The European Union has taken a lead in setting up a regulatory framework for Internet Priva y and has issued a dire tive whi h sets guidelines for pro essing and transfer of
personal data [15. Unfortunately in U.S. there is no unifying framework in pla e, though U.S. Federal Trade Commission (FTC) after a study of ommer ial Web sites has
re ommended that Congress develop legislation to regulate
the personal information being olle ted at Web sites[26.
7. CONCLUSIONS
This paper has attempted to provide an up-to-date survey

of the rapidly growing area of Web Usage mining. With
the growth of Web-based appli ations, spe i ally ele troni
ommer e, there is signi ant interest in analyzing Web usage data to better understand Web usage, and apply the
knowledge to better serve users. This has led to a number of
ommer ial oerings for doing su h analysis. However, Web
Usage mining raises some hard s ienti questions that must
be answered before robust tools an be developed. This arti le has aimed at des ribing su h hallenges, and the hope
is that the resear h ommunity will take up the hallenge of
addressing them.
8. REFERENCES
[1 A rue. http://www.a rue. om.
[2 Alladvantage. http://www.alladvantage. om.
[3 Andromedia aria. http://www.andromedia. om.
[4 Broadvision. http://www.broadvision. om.
[5 Hit list ommer e. http://www.marketwave. om.
[6 Likeminds. http://www.andromedia. om.
[7 Netgenesis. http://www.netgenesis. om.
[8 Netper eptions. http://www.netper eptions. om.
[9 Netzero. http://www.netzero. om.
[10 Platform
for
http://www.w3.org/P3P/.
priva y
proje t.
[11 Surfaid analyti s. http://surfaid.dfw.ibm. om.

[12 Truste:
Building a web you an believe in.
http://www.truste.org/.
[13 Webtrends log analyzer. http://www.webtrends. om.
[14 World wide web ommittee web usage hara terization
a tivity. http://www.w3.org/WCA.
[15 European ommission. the dire tive on the prote tion
of individuals with regard ot the pro essing of personal data and on the free movement of su h data.
http://www2.e ho.lu/, 1998.
[16 Data mining: Crossing the hasm, 1999. Invited talk at
the 5th ACM SIGKDD Int'l Conferen e on Knowledge
Dis overy and Data Mining(KDD99).
SIGKDD, Jan 2000.
INPUT
Site Files
Access Log
Referrer Log
PREPROCESSING
Site Spider
Registration or
Remote Agent
Data
Agent Log
Data Cleaning
User Identification
Session Identification
Path Completion
Classification
Algorithm
Episode
Identification
Server Session File
Site Topology
Site Content
Episode File
PATTERN
DISCOVERY
Page Classification
Sequential
Pattern
Mining
Sequential Patterns
PATTERN ANALYSIS
Association
Rule Mining
Clustering
Page Clusters
Information
Filter
User Clusters
Association Rules
OLAP/
Visualization
Standard
Statistics
Package
Usage Statistics
Knowledge
Query
Mechanism
"Interesting" Rules, Patterns,

and Statistics
Figure 5: Ar hite ture for the WebSIFT System

1999 ACM
SIGKDD, Jan 2000.
[17 Charu C Aggarwal and Philip S Yu. On disk a hing

of web obje ts in proxy servers. In CIKM 97, pages
238{245, Las Vegas, Nevada, 1997.
[18 R. Agrawal and R. Srikant. Fast algorithms for mining
asso iation rules. In Pro . of the 20th VLDB Conferen e, pages 487{499, Santiago, Chile, 1994.
[19 Virgilio Almeida, Azer Bestavros, Mark Crovella, and
Adriana de Oliveira. Chara terizing referen e lo ality
in the www. Te hni al Report TR-96-11, Boston University, 1996.
[20 Martin F Arlitt and Carey L Williamson. Internet web
servers: Workload hara terization and performan e
impli ations. IEEE/ACM Transa tions on Networking,
5(5):631{645, 1997.
[21 M. Balabanovi and Y. Shoham. Learning information retrieval agents: Experiments with automated
web browsing. In On-line Working Notes of the AAAI
Spring Symposium Series on Information Gathering
from Distributed, Heterogeneous Environments, 1995.
[22 Alex Bu hner and Mauri e D Mulvenna. Dis overing
internet marketing intelligen e through online analyti al web usage mining. SIGMOD Re ord, 27(4):54{61,
1998.
[23 L. Catledge and J. Pitkow. Chara terizing browsing behaviors on the world wide web. Computer Networks and
ISDN Systems, 27(6), 1995.
[24 M.S. Chen, J. Han, and P.S. Yu. Data mining: An
overview from a database perspe tive. IEEE Transa tions on Knowledge and Data Engineering, 8(6):866{
883, 1996.
[25 M.S. Chen, J.S. Park, and P.S. Yu. Data mining
for path traversal patterns in a web environment. In
16th International Conferen e on Distributed Computing Systems, pages 385{392, 1996.
[26 Roger Clarke. Internet priva y on erns onf the ase
for intervention. 42(2):60{67, 1999.
[27 E. Cohen, B. Krishnamurthy, and J. Rexford. Improving end-to-end performan e of the web using server
volumes and proxy lters. In Pro . ACM SIGCOMM,
pages 241{253, 1998.
[28 Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Grouping web page referen es into transa tions for mining world wide web browsing patterns. In
Knowledge and Data Engineering Workshop, pages 2{9,
Newport Bea h, CA, 1997. IEEE.
[31 Robert Cooley, Pang-Ning Tan, and Jaideep Srivastava.

Dis overy of interesting usage patterns from web data.
Te hni al Report TR 99-022, University of Minnesota,
1999.
[32 T. Faw ett and F. Provost. A tivity monitoring: Noti ing interesting hanges in behavior. In Fifth ACM
SIGKDD International Conferen e on Knowledge Dis overy and Data Mining, pages 53{62, San Diego, CA,
1999. ACM.
[33 U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From
data mining to knowledge dis overy: An overview. In
Pro . ACM KDD, 1994.
[34 David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Inferring web ommunities from link topology. In
Conferen e on Hypertext and Hypermedia. ACM, 1998.
[35 Chi E. H., Pitkow J., Ma kinlay J., Pirolli P., Gossweiler, and Card S. K. Visualizing the evolution of web
e ologies. In CHI '98, Los Angeles, California, 1998.
[36 Bernardo Huberman, Peter Pirolli, James Pitkow, and
Rajan Kukose. Strong regularities in world wide web
surng. Te hni al report, Xerox PARC, 1998.
[37 T. Joa hims, D. Freitag, and T. Mit hell. Webwat her:
A tour guide for the world wide web. In The 15th International Conferen e on Arti ial Intelligen e, Nagoya,
Japan, 1997.
[38 Reagle Joseph and Cranor Lorrie Faith. The platform
for priva y preferen es. 42(2):48{55, 1999.
[39 H. Lieberman. Letizia: An agent that assists web
browsing. In Pro . of the 1995 International Joint Conferen e on Arti ial Intelligen e, Montreal, Canada,
1995.
[40 Stephen Lee Manley. An Analysis of Issues Fa ing
World Wide Web Servers. Undergraduate, Harvard,
1997.
[41 B. Masand and M. Spiliopoulou, editors. Workshop on
Web Usage Analysis and User Proling (WebKDD),
1999.
[42 B. Mobasher, N. Jain, E. Han, and J. Srivastava. Web
mining: Pattern dis overy from world wide web transa tions. (TR 96-050), 1996.
[43 Bamshad Mobasher, Robert Cooley, and Jaideep Srivastava. Creating adaptive web sites through usagebased lustering of urls. In Knowledge and Data Engineering Workshop, 1999.
[29 Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Web mining: Information and pattern dis overy on the world wide web. In International Conferen e on Tools with Arti ial Intelligen e, pages 558{
567, Newport Bea h, 1997. IEEE.
[44 Olfa Nasraoui, Raghu Krishnapuram, and Anupam

Joshi. Mining web a ess logs using a fuzzy relational lustering algorithm based on a robust estimator.
In Eighth International World Wide Web Conferen e,
Toronto, Canada, 1999.
[30 Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Data preparation for mining world wide web
browsing patterns. Knowledge and Information Systems, 1(1), 1999.
[45 D.S.W. Ngu and X. Wu. Sitehelper: A lo alized agent

that helps in remental exploration of the world wide
web. In 6th International World Wide Web Conferen e,
Santa Clara, CA, 1997.
1999 ACM
SIGKDD, Jan 2000.
[46 Balaji Padmanabhan and Alexander Tuzhilin. A beliefdriven method for dis overing unexpe ted patterns. In
Fourth International Conferen e on Knowledge Dis overy and Data Mining, pages 94{100, New York, New
York, 1998.
[47 M. Pazzani, L. Nguyen, and S. Mantik. Learning from
hotlists and oldlists: Towards a www information ltering and seeking agent. In IEEE 1995 International
Conferen e on Tools with Arti ial Intelligen e, 1995.
[48 Mike Perkowitz and Oren Etzioni. Adaptive web sites:
Automati ally synthesizing web pages. In Fifteenth National Conferen e on Arti ial Intelligen e, Madison,
WI, 1998.
[49 Mike Perkowitz and Oren Etzioni. Adaptive web sites:
Con eptual luster mining. In Sixteenth International
Joint Conferen e on Arti ial Intelligen e, Sto kholm,
Sweden, 1999.
[50 Peter Pirolli, James Pitkow, and Ramana Rao. Silk
from a sow's ear: Extra ting usable stru tures from
the web. In CHI-96, Van ouver, 1996.
[51 G. Salton and M.J. M Gill. Introdu tion to Modern Information Retrieval. M Graw-Hill, New York, 1983.
[52 S. S he hter, M. Krishnan, and M. D. Smith. Using
path proles to predi t http requests. In 7th International World Wide Web Conferen e, Brisbane, Australia, 1998.
[53 Cyrus Shahabi, Amir M Zarkesh, Jafar Adibi, and
Vishal Shah. Knowledge dis overy from users web-page
navigation. In Workshop on Resear h Issues in Data
Engineering, Birmingham, England, 1997.
[54 E. Spertus. Parasite : Mining stru tural information on
the web. Computer Networks and ISDN Systems: The
International Journal of Computer and Tele ommuni ation Networking, 29:1205{1215, 1997.
[55 Myra Spiliopoulou and Lukas C Faulsti h. Wum: A
web utilization miner. In EDBT Workshop WebDB98,
Valen ia, Spain, 1998. Springer Verlag.
[56 Kun-lung Wu, Philip S Yu, and Allen Ballman. Speedtra er: A web usage mining and analysis tool. IBM
Systems Journal, 37(1), 1998.
[57 T. Yan, M. Ja obsen, H. Gar ia-Molina, and U. Dayal.
From user a ess patterns to dynami hypertext linking. In Fifth International World Wide Web Conferen e, Paris, Fran e, 1996.
[58 O. R. Zaiane, M. Xin, and J. Han. Dis overing web
a ess patterns and trends by applying olap and data
mining te hnology on web logs. In Advan es in Digital
Libraries, pages 19{29, Santa Barbara, CA, 1998.
[59 Amir Zarkesh, Jafar Adibi, Cyrus Shahabi, Reza Sadri, and
Vishal Shah. Analysis and design of server informative wwwsites. In Sixth International Conferen e on Information and
Knowledge Management, Las Vegas, Nevada, 1997.
1999 ACM
SIGKDD, Jan 2000.

Materi B

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Materi B

Transféré par

Droits d'auteur :

Formats disponibles

Web Usage Mining: Discovery and Applications of Usage

Patterns from Web Data

Jaideep Srivastava , Robert Cooley , Mukund Deshpande, Pang-Ning Tan

The ease and speed with whi h business transa tions an

One of the key steps in Knowledge Dis overy in Databases

Content: The real data in the Web pages, i.e. the

Stru ture: Data whi h des ribes the organization of

Supported by ARL ontra t DA/DAKF11-98-P-0359

SIGKDD, Jan 2000.

data the Web page was designed to onvey to the users.

Usage: Data that des ribes the pattern of usage of

Web pages, su h as IP addresses, page referen es, and

2.1 Data Sources

2.1.1 Server Level Collection

The implementation of lient-side data olle tion methods

2.1.3 Proxy Level Collection

2.2 Data Abstractions

The information provided by the data sour es des ribed

SIGKDD, Jan 2000.

Volume 1, Issue 2 - page 2

page view a essed through a lient or proxy-level a he will

WEB USAGE MINING

Prepro essing onsists of onverting the usage, ontent, and

3.1.1 Usage Preprocessing

servi e providers (ISPs) typi ally have a pool of proxy

 Multiple IP address/Single Server Session - Some ISPs

or priva y tools randomly assign ea h request from a

 Multiple IP address/Single User - A user that a esses

the Web from di erent ma hines will have a di erent

 Multiple Agent/Singe User - Again, a user that uses

more than one browser, even on the same ma hine,

3.1.2 Content Preprocessing

SIGKDD, Jan 2000.

Volume 1, Issue 2 - page 3

Figure 1: High Level Web Usage Mining Pro ess

Method/ URL/ ProtocolStatusSize Referrer

[25/Apr/1998:03:04:41 -0500] "GET A.html HTTP/1.0" 200 3290

[25/Apr/1998:03:05:39 -0500] "GET L.html HTTP/1.0" 200 4130

[25/Apr/1998:03:06:58 -0500] "GET A.html HTTP/1.0" 200 3290

[25/Apr/1998:03:07:55 -0500] "GET R.html HTTP/1.0" 2008140 L.html Mozilla/3.04 (Win95, I)

[25/Apr/1998:03:10:02 -0500] "GET O.html HTTP/1.0" 2002270 F.html Mozilla/3.04 (Win95, I)

[25/Apr/1998:03:12:23 -0500] "GET G.html HTTP/1.0" 2007220 B.html Mozilla/3.04 (Win95, I)

[25/Apr/1998:05:05:22 -0500] "GET A.html HTTP/1.0" 200 3290

[25/Apr/1998:05:06:03 -0500] "GET D.html HTTP/1.0" 2001680 A.html Mozilla/3.04 (Win95, I)

Mozilla/3.01 (X11, I, IRIX6.2, IP22)

Figure 2: Sample Web Server Log

SIGKDD, Jan 2000.

Volume 1, Issue 2 - page 4

than an be pra ti ally prepro essed. A given set of server

3.1.3 Structure Preprocessing

3.2 Pattern Discovery

3.2.1 Statistical Analysis

SIGKDD, Jan 2000.

Volume 1, Issue 2 - page 5

ing te hniques that an be employed to model the browsing

3.3 Pattern Analysis

Pattern analysis is the last step in the overall Web Usage

TAXONOMY AND PROJECT SURVEY

Sin e 1996 there have been several resear h proje ts and

4.1 Taxonomy Dimensions

While the number of andidate dimensions that an be used

4.2 Project Survey

As shown in Figures 3 and 4, usage patterns extra ted from

SIGKDD, Jan 2000.

Multiple IP address/Single Server Session - Some ISPs

Multiple IP address/Single User - A user that a esses

the Web from dierent ma hines will have a dierent

Multiple Agent/Singe User - Again, a user that uses