Vous êtes sur la page 1sur 8

ONTOLOGY & CONTEXT/PERSPECTIVE – BASED WEB MINING USING

WORDNET DATABASE

Abstract
Web mining refers to the discovery of
knowledge from Web data that include Web pages,
The World Wide Web today provides users media objects on the Web, Web links, Web log
access to extremely large number of Web sites data, and other data generated by the usage of Web
many of which contain information of education
data.
and commercial values. Technology in the field
of digital media generates huge amounts of Kosala and Blockeel classified Web mining
textual information. The potential for exchange into:
and retrieval of information is vast and
daunting. The key problem in achieving efficient (a) Web content mining
and user-friendly retrieval is the development of (b) Web structure mining and
a search mechanism to guarantee delivery of (c) Web usage mining [3].
minimal irrelevant information (high precision)
while insuring relevant information is not
overlooked (high recall). The traditional
• Web content mining refers to mining
solution employs keyword-based search. The knowledge from Web pages and other Web
only documents retrieved are those containing objects.
user specified keywords. But many documents • Web structure mining refers to mining
convey desired semantic information without knowledge about link structure connecting Web
containing these keywords. The metrics used to pages and other Web objects.
measure the quality of an ideal search result are
both accuracy and completeness. • Webusage mining refers to the mining of usage
The study investigates an approach, patterns of Web pages found among users
Discovery, for using (1) context / perspective accessing a Website.
information and (2) social networks such as
ODP or Wikipedia for designing practical and Among the three, Web content mining is perhaps
scalable human-web systems for finding web studied most extensively due to the prior work in
pages that are relevant and meet the needs and text mining. The traditional topics covered by Web
requirements of a user or a group of users. The content mining include:
resulting system arguably meets the common
needs and requirements of a group of people
• Web page classification: This involves the
based on the information provided by the group
in the form of a set of context web pages. classification of Web pages under some pre-defined
categories that may be organised in a tree or other
structures.

I. INTRODUCTION
• Web clustering: This involves the grouping of
The ubiquity of web can be characterized by Web pages based on the similarities among them.
the enormous volume and coverage of web Each resultant group should have similar Web
content, the phenomenal number of web users pages while Web pages from different resultant
and businesses, the vast number of computers groups should be dissimilar.
and devices accessing web, and the large number
of web-based applications. Users today perform
• Web extraction: This involves extracting HTML
more searches using web search engines than
OPAC (Online Public Access Catalog) systems. elements, term phrases, or tuples from Web pages
that represent some required concept instances, e.g., In this respect, an ontology which gives a
person names, location names, book records, etc.. conceptual description of the
background semantics can serve as a
The ubiquity of web offers some obvious very useful input to the Web mining
explanations, namely: problems.

• The coverage of web content is so large An Ontology refers to a set of concepts and the
that it is difficult for any traditional relationships, together known as
digital library to match; ontology entities, describing the
• The ability to browse web content information within an application
directly on the users’ computers and the domain.
ease of downloading them is clearly a
big draw; and When an ontology is used in solving a web
classification or extraction problem, the
• The availability of web search engines
results obtained can be associated with
(e.g. Google) and web directories (e.g.
the ontology entities, making them
Yahoo!, DMOZ) has helped
easier to understand. This is a big
tremendously simplifying the process of
advantage because each ontology often
searching web content.
represents knowledge agreed upon by
users and applications of a domain.
Nevertheless, web content is not always easy to
use. Due to the unstructured and semi-structured
For the purpose of this paper, we define an
nature of web pages and the design idiosyncrasy ontology to be a set of concepts C and relationships
of web sites, it is a challenging task to develop R. The relationships in R can be either taxonomic or
digital libraries for organizing and managing non-taxonomic.
digital content from the web.
For example, Figure 1 depicts a simple University
Berners-Lee et al. therefore introduced the idea
ontology consisting of a set of concepts
of semantic web which refers to the construction
Cuniv = {Person, Faculty, Staff, Student,
of a machine-understandable semantic layer over
Department, Project, Course}, and a set of
the existing web content so as to support better relationships Runiv =
information processing and web services. While {Department_Of(Person, Department),
semantic web may take several years to realize, Member_Of(Person, Project),
digital library researchers are turning to web Instructor_Of(Course, Person),
mining techniques to improve the accessibility of Superclass_Of(Faculty, Person),
web content. The well established web mining Superclass_Of(Staff, Person),
techniques include web classification and web Superclass_Of(Student, Person)}.
extraction. Superclass_Of represents the taxonomic
relationship while the rest are not.
II OBJECTIVES
With this definition, the instances of an ontology
refer to the instances of its concepts and
Web mining techniques have shown promising
relationships. If each concept instance exists in the
performance in research experiments.
form of a Web page, a relationship instance will
Their actual deployment in live web
then exist in the form of a Web page pair. This
data, in contrast, has been fairly limited view has been adopted in most the Web
due to the lack of background semantics classification research.
required for processing the text data,
links, and other elements in web pages. On the other hand, if each concept instance exists in
3 the form an HTML element, a relationship
instance will then exist in the form of an HTML 3.1.1 Context/Perspective Extractor
element pair. This subsystem extracts context/perspective
details from the information provided by the user
This alternative view is usually adopted in Web
(keywords, relevant URLs and documents). It is
extraction research. It is noted that other forms or
highly likely that all the information provided by
hybrid forms of concept instances may also exist
the user for a specific query will be around a
for some Websites.
single theme though we envision options for
building customized, personal indices of
searched pages. The challenge is to identify the
correct theme (perspective) and present it back
to the user in the shortest time possible. To help
identify the theme, ODP, as a human-edited
hierarchy of categories (ontology) with
descriptions, will be used. A number of
algorithms will be explored to improve the
theme(s) identification. One approach will be to
create a word-topic occurrence matrix and/or
phrase-topic matrix. By topic is meant one of the
As the languages for defining ontology and using
ODP topic categories. The user-provided
the latter in marking up web content become
URLs/documents can be matched against the
well accepted. This paper describes the initiation
word / phrase-ODP topic matrices after Singular
of a systematic design research study based on
Value Decomposition (SVD) by identifying the
prototyping / evaluation and abstraction using
cosine similarities.
existing and new techniques incorporated as plug
and play components into a research workbench.
3.1.2 Semantic Query Constructor
The study investigates an approach called Semantic query construction is an important part
DISCOVERY for using (1) context / perspective of the architecture for DISCOVERY. This
information and (2) social networks such as ODP process will be used to disambiguate query terms
or Wikipedia for designing practical and scalable using WordNet (WordNet 2008). The industrial
human-web systems for finding web pages that ontology group in Finland –
are relevant and meet the needs and requirements http://www.cs.jyu.fi/ai/OntoGroup/InBCT_May
of a user or a group of users. _2004. html – has done some significant study
about enhancing queries. WordNet’s hyponyms
III. DISCOVERY APPROACH AND can be used to provide a more generic
representation of the query. Enhanced queries
ITS IMPLEMENTATION
can also be constructed with the help of the most
common keywords and phrases in the context
3.1 Overview information provided by the user and by using
the keywords/phrases representative of the
A system based on the DISCOVERY approach perspectives selected/provided by the user.
uses the following major subsystems: Query refinement with lexicons and ontologies
may beexplored using a methodology called
a. Context/Perspective Extractor CONQUER (CONtext-aware QUERy
b. Semantic Query Constructor processing).
c. Results Categorizer/Visualizer
d. Context/Perspective Suggestion Agent (Profile
3.1.3 Results Categorizer/Visualizer
Builder)
Clustering is an unsupervised learning process as
opposed to classification. Clustering the
documents against the perspectives selected by IV IMPLEMENTATION
the user is an interesting problem. Some The implementation of a system for
documents tend to be part of several perspectives DISCOVERY implements the first two
(which may show the interrelation between the subsystems – Context / Perspective Extractor and
perspectives). The main future research activity Semantic Query Constructor.
here is to identify the Best Matching Unit
(BMU) between the documents and between Figure 2: Details on the flow of events in the
documents and perspectives. Each perspective current implementation.
can be characterized with a set of
keywords/phrases and their respective calculated
distance to the retrieved results and presented to
the user.
This approach also paves the way to explore
certain hidden information, based on the user’s
discretion. Self-Organizing Maps (SOM) can be
used to cluster the result sets. The Kohonen
SOM network is very effective for visualization
of high dimensional data (Zhao and Ram 2004).
It compresses information while preserving the
most important topological and geometric
relationships of the primary data elements on the
display. Figure 2: Experimental Prototype – Flow of
Events
The main advantage is to gain insight into the
(hidden) structure of data by observing the map, 3.3 System Usage Scenarios
due to the topology preserving nature of SOM.
(Bakus et al. 2002) describe the use of phrases
The following scenarios provide a sample of the
for document clustering with SOM. (Amine et al.
ways a system implementing the DISCOVERY
2008) use SOM for concept based clustering of
approach might be used.
textual documents.
Scenario: A civil engineer interested to make a
3.1.4 Context/Perspective Suggestion Agent building design, and specifically in “Home,” can
(Profile Building) use the tool as follows :
A significant part of DISCOVERY is building
user-profiles by gathering perspective related • Context Profile: The user can input keyword
information over a period of time. The objective (“Home”) and/or related documents (e.g.
of this module (a type of recommender system) building construction URLs). Such related
is to learn and suggest perspectives for different documents act as the user-context.
types of topics.
• The system interacts with the user presenting a
One approach could be to use collaborative set of perspectives drawn from ODP topic
filtering algorithms to identify the user categories such as:
perspectives proactively (Das et al. 2007). Once
the system is put to use and data is gathered home, place: where you live at a particular
about user activities, further research can be time dwelling, home, domicile, abode,
conducted in this area of predicting user habitation, dwelling
interests. house: housing that someone is living in
home: the country or state or city where you the word form to look up.
live home: an environment offering affection */
and security home, nursing home, rest public static void main(String[] args)
home: an institution where people are cared {
for base, if (args.length > 0)
{
• Context/Perspective Selection: The user may // Concatenate the command-line arguments
choose among suggested perspectives and/or StringBuffer buffer = new StringBuffer();
may add his/her own perspective (e.g. “housing for (int i = 0; i < args.length; i++)
that someone is living in”). {
buffer.append((i > 0 ? " " : "") + args[i]);
• The system disambiguates query terms using }
WordNet and then may semantically enhances String wordForm = buffer.toString();
the query using an ontology. // Get the synsets containing the word form
• The system executes the query and can WordNetDatabase database =
classify, filter, and cluster the results based on WordNetDatabase.getFileInstance();
the perspective(s) identified or suggested by the Synset[] synsets =
user. database.getSynsets(wordForm);
• Results Visualization: The user will // Display the word forms and definitions for
manipulate an easy-to navigate interface with synsets retrieved
which to identify the result density for a if (synsets.length > 0)
particular perspective and the distance between {
the perspectives. System.out.println("The following synsets
• User activities may be captured and recorded contain '" +
as part of an extended activity profile cover an wordForm + "' or a possible base form of that
extended period of time. text:");
for (int i = 0; i < synsets.length; i++)
3.4 Sample Java API using WordNet {
System.out.println("");
jawsTest.java
String[] wordForms =
package edu.smu.tspell.wordnet.impl.file.test;
synsets[i].getWordForms();
/**
for (int j = 0; j < wordForms.length; j++)
Displays word forms and definitions for synsets
{
containing the word form specified on the
System.out.print((j > 0 ? ", " : "") +
command line. To use this application, specify
wordForms[j]);
the word form that we wish to view synsets for,
}
as in the following example which displays all
System.out.println(": " +
synsets containing the word
synsets[i].getDefinition());
form "home":
}
<br>
}
java jawsTest home
else
*/
{
public class jawsTest
System.err.println("No synsets exist that contain
{
"+
/**
"the word form '" + wordForm + "'");
Main entry point. The command-line arguments
}
are
}
concatenated together (separated by spaces) and
else
used as
{
System.err.println("You must specify " + home, interior, internal, national: inside the
"a word form for which to retrieve synsets."); country
} D:\edu\smu\tspell\wordnet\impl\file\test
}
}

Output: V. CONCLUSION
D:\edu\smu\tspell\wordnet\impl\file\test>java -
classpath .;C:/ mywork/code/jaws.jar - The work presented in this paper follows the
Dwordnet.database.Dir = d:/WordNet/ 2.1/dict design science research paradigm regarding
jawsTest Building research artifacts and evaluating the
Home same to test for feasibility, effectiveness, and
efficiency, and abstracting the knowledge gained
The following synsets contain 'home' or a in terms of design principles and theories are
possible base form of that text: among the important research activities in design
science research. Start with a general approach,
home, place: where you live at a particular time DISCOVERY, and carry out its situated
dwelling, home, domicile, abode, habitation, implementation for carrying out explorations of
dwelling classes of web mining tasks. Experimentation
house: housing that someone is living in and evaluation of the resulting artifact aims to
home: the country or state or city where you live find general design principles and mid-range
home: an environment offering affection and web mining theories.
security
home, nursing home, rest The results reported in the paper point to the
home: an institution where people are cared for potential of the DISCOVERY approach,
base, following design science methodology, to
home: the place where you are stationed and advance knowledge. Future work will complete
from which missions start and end the research study based on the DISCOVERY
family, household, house, home, menage: a approach. It will also develop a rigorous
social unit living together experimental framework for evaluating a system
home plate, home base, home, plate: (baseball) based on the approach (using WordNet as a
base source of experimental data), and generalizing
consisting of a rubber slab where the batter the system.
stands; it
must be touched by a base runner in order to REFERENCES
score
home: place where something began and [1] Hevner, A., March, S., Park, J. and Ram, S.
flourished "Design Science inInformation Systems
home: provide with, or send to, a home Research," MIS Quarterly (28:1), pp. 75- 105,
home: return home accurately from a long 2004.
distance
[2] Kobayashi, M. and Takeda, K., "Information
home: used of your own ground Retrieval on the Web," ACM Computing Surveys
home: relating to or being where one lives or (32:2), pp. 144-173, 2000.
where one's roots are
home: at or to or in the direction of one's home [3] Kuechler, B. and Vaishnavi, V., “On Theory
or family Development in Design Science Research:
home: on or to the point aimed at Anatomy of a Research Project,” European
Journal of Information Systems, Vol. 17, No. 5,
home: to the fullest extent; to the heart
pp.489-504, 2008.
[4] Larsen, B. and Aone, A. "Fast and Effective [12] Chen, H., "Web Retrieval and Mining,"
Text Mining Using Linear-time Document Decision Support Systems (35) pp. 1-5, 2003
Clustering," Proc. Fifth ACM SIGKDD nt'l
Conference on Knowledge Discovery andData
Mining, pp.6-22, 1999.

[5] Mathes, A., “Folksonomies - Cooperative


Classification and communication through
Shared Metadata,” University of Illinoisrbana-
Champaign, December 2004.Hevner, A., March,
S., Park,. and Ram, S. "Design Science in
Information Systems research," MIS Quarterly
(28:1), pp. 75-105, 2004.

[6] Sun A., and Lim, E.-P., “Web Unit Based


Mining of Homepage relationships,” Journal of
American Society for Information Science and
Technology (JASIST), 2005.

[7] Latifur Khan, “Ontology-based Information


Selection,” Ph.D.hesis, University of South
California, 2000.

[8] Hildebrand, M., Van Ossenbruggen, J. &


Hardman, L., “facet: Browser for Heterogeneous
Semantic Web Repositories,”roceedings of the
5th International Semantic Web Conference,
thens, USA, Nov 2006.

[9] Stefanidis, K., Pitoura, E. and Vassiliadis, P.,


“On Relaxing ontextual Preference Queries,”
Proc. International onference on Mobile Data
Management, 2007.

[10] Bakus, J., Hussin, M. F. and Kamel, M., “A


SOM based document clustering using phrases,”
Proc. 9th International Conference on Neural
information Processing (ICONIP‘OZ), Vol. 5,
2002.Kobayashi, M. and Takeda, K.,
"Information Retrieval on the Web," ACM
Computing Surveys (32:2), pp.144- 173, 2000.

[11] Brin, S. and Page, L., "The Anatomy of a


Large-Scale Hypertext Web Search Engine,"
Proc. 7th WWW conference, 1998.

Vous aimerez peut-être aussi