Académique Documents
Professionnel Documents
Culture Documents
hnology
Fa
ulty of Information Te
hnology
Radek Burget
M.S
., Brno University of Te
hnology, 2001
Abstra
t v
Prefa
e vi
1 Introdu
tion 1
1.1 Information Extra
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 The World Wide Web Te
hnology 5
2.1 Hypertext Markup Language . . . . . . . . . . . .. . .. . . .. . .. . 5
2.2 Cas
ading Style Sheets . . . . . . . . . . . . . . . .. . .. . . .. . .. . 6
2.3 Dynami
Web Content . . . . . . . . . . . . . . . .. . .. . . .. . .. . 7
2.4 Web Servi
es . . . . . . . . . . . . . . . . . . . . .. . .. . . .. . .. . 8
2.5 Semanti
Web . . . . . . . . . . . . . . . . . . . . .. . .. . . .. . .. . 9
2.6 Current Information Extra
tion Alternatives . . .. . .. . . .. . .. . 10
3 State of the Art 12
3.1 Information Extra
tion from HTML Do
uments . . . . . . . . .. . .. . 12
3.1.1 Wrappers . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . 14
3.1.2 Do
ument Code Modeling . . . . . . . . . . . . . . . . .. . .. . 16
3.1.3 Wrapper Indu
tion Approa
hes . . . . . . . . . . . . . .. . .. . 17
3.1.4 Alternative Wrapper Constru
tion Approa
hes . . . . .. . .. . 20
3.1.5 Computer-aided Manual Wrapper Constru
tion . . . . .. . .. . 21
3.2 Cas
ading Style Sheets from the IE Perspe
tive . . . . . . . . .. . .. . 21
3.3 Advan
ed Do
ument Modeling . . . . . . . . . . . . . . . . . .. . .. . 22
3.3.1 Logi
al versus Physi
al Do
uments . . . . . . . . . . . .. . .. . 22
3.3.2 Logi
al Do
ument Dis
overy . . . . . . . . . . . . . . .. . .. . 23
3.3.3 Logi
al Stru
ture of Do
uments . . . . . . . . . . . . . .. . .. . 24
3.3.4 Visual Analysis of HTML Do
uments . . . . . . . . . .. . .. . 26
4 Motivation and Goals of the Thesis 28
5 Visual Information Modeling Approa
h to Information Extra
tion 31
5.1 Proposed Approa
h Overview . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Visual Information Modeling . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Modeling the Page Layout . . . . . . . . . . . . . . . . . . . . . . 34
iii
5.2.2 Representing the Text Features . . . . . . . . . . . . . .. . .. . 37
5.3 Representing the Hypertext Links . . . . . . . . . . . . . . . .. . .. . 40
5.4 HTML Code Analysis for Creating the Models . . . . . . . . .. . .. . 40
5.4.1 Tables in HTML . . . . . . . . . . . . . . . . . . . . . .. . .. . 44
5.4.2 Example of Visual Models . . . . . . . . . . . . . . . . .. . .. . 45
5.5 Logi
al Stru
ture of a Do
ument . . . . . . . . . . . . . . . . .. . .. . 46
5.6 Information Extra
tion from the Logi
al Model . . . . . . . . .. . .. . 50
5.6.1 Using Tree Mat
hing . . . . . . . . . . . . . . . . . . . .. . .. . 51
5.6.2 Approximate Unordered Tree Mat
hing Algorithm . . .. . .. . 52
5.7 Information Extra
tion from Logi
al Do
uments . . . . . . . .. . .. . 54
6 Experimental System Implementation 56
6.1 System Ar
hite
ture Overview . . . . . . . . . . .. . .. . . .. . .. . 56
6.2 Using XML for Module Communi
ation . . . . . .. . .. . . .. . .. . 58
6.2.1 Representing the Logi
al Do
uments . . . .. . .. . . .. . .. . 58
6.2.2 Logi
al Stru
ture Representation . . . . . .. . .. . . .. . .. . 58
6.3 Implementation . . . . . . . . . . . . . . . . . . . .. . .. . . .. . .. . 59
6.3.1 Interfa
e Module . . . . . . . . . . . . . . .. . .. . . .. . .. . 59
6.3.2 Logi
al Do
ument Module . . . . . . . . . .. . .. . . .. . .. . 60
6.3.3 Analysis Module . . . . . . . . . . . . . . .. . .. . . .. . .. . 60
6.3.4 Extra
tion Module . . . . . . . . . . . . . .. . .. . . .. . .. . 60
6.3.5 Control Panel . . . . . . . . . . . . . . . . .. . .. . . .. . .. . 62
6.4 Information Extra
tion Output . . . . . . . . . . .. . .. . . .. . .. . 63
6.4.1 Extra
ted Data as an XML Do
ument . . .. . .. . . .. . .. . 63
6.4.2 Extra
ted Data as an SQL S
ript . . . . . .. . .. . . .. . .. . 63
7 Method Evaluation 65
7.1 Experiments on Physi
al Do
uments . . . . . . . .. . .. . . .. . .. . 65
7.1.1 Experiment 1 - Personal Information . . . .. . .. . . .. . .. . 65
7.1.2 Experiment 2 - Sto
k Quotes . . . . . . . .. . .. . . .. . .. . 67
7.2 Independen
e on Physi
al Realization . . . . . . .. . .. . . .. . .. . 67
7.3 Information Extra
tion from Logi
al Do
uments .. . .. . . .. . .. . 69
8 Con
lusions 73
8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2 Possible Improvements and Future Work . . . . . . . . . . . . . . . . . . 74
Bibliography 75
A Example Task Spe
i
ation 80
B Do
ument Type Denitions 83
B.1 Task Spe
i
ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B.2 Logi
al Do
ument Representation . . . . . . . . . . . . . . . . . . . . . . 84
B.3 Logi
al Stru
ture Representation . . . . . . . . . . . . . . . . . . . . . . 84
iv
Abstra
t
The World Wide Web presents the largest Internet sour
e of information from a broad
range of areas. The web do
uments are mostly written in the Hypertext Markup
Language (HTML) that doesn't
ontain any means for semanti
des
ription of the
ontent and thus the
ontained information
annot be pro
essed dire
tly. Current
approa
hes for the information extra
tion from HTML are mostly based on wrappers
that identify the desired data in the do
ument a
ording to some previously spe
ied
properties of the HTML
ode. The wrappers are limited to a narrow set of do
uments
and they are very sensitive to any
hanges in the do
ument formatting.
In this thesis, we propose a novel approa
h to information extra
tion that is based
on modeling the visual appearan
e of the do
ument. We show that there exist some
general rules for the visual presentation of the data in do
uments and we dene formal
models of the visual information
ontained in a do
ument. Furthermore, we propose
the way of modeling the logi
al stru
ture of an HTML do
ument based on the visual
information. Finally, we propose methods for using the logi
al stru
ture model for the
information extra
tion task based on tree mat
hing algorithms. The advantage of this
approa
h is
ertain independen
e on the underlying HTML
ode and better resistan
e
to
hanges in the do
uments.
v
Prefa
e
In 2001, when I nished my master studies at the Fa
ulty of Ele
tri
al Engineering and
Computer S
ien
e at the Brno University of Te
hnology, I was thinking about enrolling
the PhD. program. There were multiple topi
s available but one of them attra
ted me
more than the other ones: \Methods of knowledge dis
overy in the WWW". Sin
e I
was fas
inated by everything related to the web, I started working on this topi
under
the supervision of Jaroslav Zendulka. Very qui
kly I realized how broad this area is
and I de
ided to fo
us on pro
essing the web
ontent, whi
h is always the rst step of
the data mining pro
ess.
During the rst year, I have been trying to orientate myself in the topi
. In addition
to reading great amounts of available papers, I attended the EDBT 2002 Summer S
hool
on distributed databases on the Internet. In 2002, I also spent three months at a study
stay at the University of Valladolid in Spain, whi
h has been also very valuable for me.
This thesis is the result of the last three years' work. In my rst papers
on
erning
this topi
[8, 9℄, I formulated the features of
urrent information extra
tion approa
hes
that in my opinion
aused the major problems of the existing methods and I proposed
a more abstra
t look at the HTML do
uments. During further work, I proposed some
suitable models of the do
uments and the methods of using them for information ex-
tra
tion [10℄. The last year I spent on a formal spe
i
ation of the proposed models
and methods [11℄.
Organization of the Thesis
This thesis is organized as follows. Chapter 1
ontains a short introdu
tion to the
area of pro
essing the data a
essible through the World Wide Web and explains basi
on
epts of the information extra
tion from the web do
uments.
Chapter 2 gives a brief overview of the
urrent World Wide Web te
hnology. Basi
on
epts of the most important languages are explained and the main te
hnologi
al
aspe
ts and forth
oming te
hnologies are dis
ussed.
In Chapter 3, the state of the art in the information extra
tion from HTML do
u-
ments and related areas is summarized.
In Chapter 4 we summarize major problems of the
urrent information extra
tion
approa
hes as the motivation for this thesis and we formulate the goals of this thesis.
Chapter 5 is the theoreti
al
ore of the thesis: it
ontains the formal models of the
visual appearan
e and the logi
al stru
ture of HTML do
uments and introdu
es a novel
information extra
tion method based on these models.
vi
In
hapter 6, we des
ribe an experimental system that implements the proposed
information extra
tion method and that has been used for testing the method on real
data.
Chapter 7 summarizes the experimental results.
And nally in Chapter 8, we
on
lude the thesis. We dis
uss possible uses of the
information extra
tion and of the logi
al do
ument stru
ture modeling. We summarize
the major
ontributions of the thesis and we propose possible improvements of the
method and dire
tions of further investigation in this area.
The appendi
es
ontain the XML spe
i
ations of the information extra
tion tasks
that have been used for testing the method and formal denitions of the XML formats
used within the experimental system.
A
knowledgements
I would like to express my sin
ere thanks to people who have helped me writing this
thesis: Jaroslav Zendulka, my supervisor, for his valuable
omments and organizational
support, Alexander Meduna for his great help with the formal spe
i
ation issues.
Many thanks also deserve people who supported me while writing this thesis: my
parents, my brother and my girlfriend Jana.
vii
Chapter 1
Introdu
tion
The World Wide Web presents
urrently the largest Internet sour
e of information from
a broad range of areas. One of the main reasons of this great expansion is the absen
e of
any stri
t rules for the information presentation and the relative simpli
ity of the used
te
hnology. The orientation to do
uments and the HTML language that is mainly used
for
reating the do
uments, gives the authors enough freedom for presenting any kind
of data with minimal eort and the new te
hnologies su
h as Cas
ading Style Sheets
(CSS) allow to a
hieve the desired quality of presentation. Together with the hypertext
nature of the do
uments, these properties make the World Wide Web a distributed and
dynami
sour
e of information.
On the other hand, the loose form of the data presentation brings also some draw-
ba
ks. With the in
reasing number of available do
uments a problem arises, how to
eÆ
iently a
ess and utilize all the data they
ontain. Due to the above properties
of the web, related information is often presented at dierent web sites and in diverse
forms. A
essing this data by browsing the do
uments manually is a time-
onsuming
and
ompli
ated task. For this reason, it is desirable to automati
ally pro
ess the do
-
uments by a
omputer. As a rst step, there exists an eort to provide the users with
entralized views of related data from various sour
es in the World Wide Web su
h
as the servi
es for
omparing the pri
es of goods in on-line shops. The next step is
presented by the data mining te
hniques that have been developed for the database
bran
h.
For all these tasks, it is ne
essary to a
ess the data that are
ontained in the do
u-
ments. This is however not a trivial problem. As mentioned above, the state-of-the-art
web
onsists mainly of the do
uments written in the Hypertext Markup Language
(HTML) [54℄. This language is suitable for the denition of the presentation aspe
t of
the do
uments but it la
ks any means for the denition of the
ontent semanti
s. The
information
ontained in the do
uments
an be therefore very hardly interpreted and
pro
essed by a
omputer.
Possible answer to this problem is the proposal of semanti
web [6℄ that is based
on dierent te
hnology and has quite dierent nature. While the
lassi
al web
an be
viewed as a distributed do
ument repository, the semanti
web has many
hara
teristi
s
of an obje
t database [34℄. Although this te
hnology is very promising and it is being
developed rapidly, it doesn't solve pro
essing of the great amount of do
uments that
1
Figure 1.1: An Information Extra
tion Task
are already available in the \lega
y" web. Moreover, simultaneously with the semanti
web, the lega
y web is still growing and developing. In
ontrast to the semanti
web, the
evolution of the te
hnologies has however quite dierent dire
tion. Currently, the most
important issue in the development of the
lassi
al web te
hnologies is the
exibility
of the web design and ee
tive management of the web
ontent. For the information
providers, it is not the aim to allow better automati
pro
essing of the do
uments on
the web and in some
ases it is even undesirable. All these fa
ts together with the great
amount of information that is virtually available make the automati
HTML do
ument
pro
essing an interesting and
hallenging area of investigation.
1.1 Information Extra
tion
Our aim in the HTML do
ument pro
essing is to identify a parti
ular information that
is expli
itly
ontained in the text of the do
ument and to store it in a stru
tured form;
e.g. to a database table or a XML do
ument. This pro
ess is usually
alled information
extra
tion from the HTML do
uments [21, 26, 38℄.
As an example, let's imagine a set of pages
ontaining information about various
ountries as shown in the Figure 1.1. The task of information extra
tion in this
ase
is to identify the values of
ountry name, area, population et
. and to store them in a
stru
tured way. The result of pro
essing su
h a set of do
uments
ould be a database
table
ontaining the appropriate values for ea
h
ountry.
There are two basi
approa
hes to the information extra
tion. The rst and most
ommon one assumes that it is spe
ied in advan
e, what data are to be extra
ted from
2
the do
uments. The spe
i
ation for the
ountry information task mentioned above
an have for example following format
C OU N T RY ::=
< name; area; population;
apital >
i.e. for ea
h
ountry we want to extra
t its name, geographi
al area, population and the
name of the
apital. This type of task is often
alled the slot lling task. In
ontrary,
the other approa
h is to analyze the do
uments and extra
t all the available data.
This approa
h requires a set of do
uments to be
ompared in order to distinguish the
relevant data from the remaining text [18℄. It is assumed that the relevant data diers
among the do
uments whereas the other text remains stati
.
The pro
ess of extra
ting information from the World Wide Web
an be split to
following phases:
1. Lo
alization of relevant do
uments
2. Identi
ation of the data in the do
uments
3. Data storage
The rst one is the information retrieval phase where the do
uments
ontaining the
appropriate data must be lo
alized in the World Wide Web. Sin
e ea
h do
ument is
identied by its URI, the result of this phase is a list of URIs of the do
uments to
be pro
essed. In the World Wide Web, there are several types of servi
es available
for lo
ating the do
uments, from the Web dire
tories that provide
ategorized lists of
URLs to the automati
sear
h engines su
h as Google 1 that are based on indexing
terms in the do
uments. The hypertext nature of HTML do
uments brings another
problem: the presented data
an be split to several HTML do
uments that
ontain
links to ea
h other. Then, this set of HTML do
uments forms a logi
al entity that
is usually
alled a logi
al do
ument [60℄. Sin
e the way in whi
h the information is
split to individual do
uments
annot be predi
ted, for extra
ting the information all
the physi
al do
uments must be pro
essed. As a result, the URIs of all the do
uments
must be dis
overed.
The identi
ation of the data in do
uments is the key phase of the information
extra
tion pro
ess. During this phase, a model of the logi
al do
ument is
reated and
the desired data values are identied by
ertain extra
tion rules. The
hara
ter of these
rules depends on the model of the do
ument and the information extra
tion method.
Alternatively, the data values may be identied based on a statisti
al model of the
do
ument (e.g. the Hidden Markov Models) instead of expli
it extra
tion rules.
The last phase of the pro
ess is the data storage. This phase depends
ompletely
on the appli
ation.
Although the information extra
tion area had established long before the World
Wide Web
ome into being and its appli
ation to the web is almost as old as the
web itself, due to the enormous extent and variability of the web it still presents a
hallenging problem that hasn't been satisfa
torily resolved yet. In this thesis, we fo
us
1
http://www.google.
om
3
on the parts of the information extra
tion pro
ess that are not suÆ
iently explored yet.
In the rst phase it is the problem of the logi
al do
ument dis
overy. We dis
uss the
existing approa
hes and we propose
ertain improvements of
urrent te
hniques. The
main fo
us of our work is on the data identi
ation phase. We analyze the problems of
urrent methods and we propose a novel approa
h to this problem based on modeling
the visual aspe
t of the do
uments and their logi
al stru
ture.
4
Chapter 2
5
</tagname>. For example, a portion of text
an be written in bold by following
ode:
Normal text <b>bold text</b> normal text.
that will be printed out as:
Normal text bold text normal text.
Some of the tags are unpaired, for example a line break
an be inserted using just a
<br> tag. Some of the tags allow or require spe
ifying additional attributes that dene
the meaning of the tag more pre
isely. For example, when inserting an image to the
do
ument, an attribute sr
must be spe
ied that identies the le where the image
is stored: <img sr
="image.jpg">.
The SGML language that has been used as a basis for
reating HTML is very pow-
erful and
exible. However, this great
exibility brings
ertain drawba
k su
h as, for
example,
ompli
ated implementation. For this reason, a simplied derivation of SGML
has been
reated that is
alled XML (eXtensible markup language). The most impor-
tant simpli
ation is that only the pair tags are allowed and the tags must be properly
nested (they may not overlap). For hypertext publishing, a XML variant of HTML has
been
reated that is
alled XHTML (eXtensible hypertext markup language). In
om-
parison to HTML, many tags have been omitted and repla
ed by other means (mainly
the Cas
ading Style Sheets that are des
ribed below). The unpaired tags have been
repla
ed by pair tags even in the
ases that the pair tag gives no sense { e.g. the above
mentioned line break must be written as <br></br> whi
h
an be a
ording to the
XML spe
i
ation also written as <br/>. Currently, XHTML is an up
oming standard
in web publishing. All the mentioned web standards are being maintained by the World
Wide Web Consortium1.
In this thesis, we fo
us on hypertext do
uments written in the HTML language
optionally in
onjun
tion with the Cas
ading Style Sheets. However, all the te
hniques
and methods proposed below apply to the XHTML language as well.
2.2 Cas
ading Style Sheets
Cas
ading Style Sheets (CSS) is a simple me
hanism for adding style (e.g. fonts,
olors,
spa
ing) to Web do
uments. It allows separating the visual style of the page from the
HTML
ode that fa
ilitates the
reation and management of the web pages and allows
better
exibility. For XHTML, the Cas
ading Style Sheets presents a unique me
hanism
for spe
ifying the visual style of the page.
The style sheets
onsist of a set of rules that dene the style for parti
ular (X)HTML
tags. For example, a rule
an look as follows:
h1 f font-weight: bold;
olor: blue; g
This rule says that the headings marked with the <h1> tag will be printed in bold and
blue
olor. It
onsists of a sele
tor that determines on whi
h tags this rule should be
1
http://www.w3.org
6
applied and a de
laration. The de
laration
onsists of a name of the CSS property (in
our
ase font-weight: and
olor:) and its value (bold and blue). The values of the
individual properties are inherited as the tags are nested in the HTML
ode { the values
of properties that are not spe
ied in our example are inherited for example from the
de
laration of the <body> that en
loses the whole
ontent of an HTML do
ument.
Ea
h element in the HTML
ode denoted by some tag
an be assigned a named
lass
using the
lass attribute (e.g. <p
lass="heading">) and/or assigned an identier
using the id attribute that is unique for the whole do
ument. The CSS rule sele
tors
an be based not only on the tag names (as we
an see in the example) but also on the
element
lasses and identiers.
The style sheets are in
orporated to the HTML
ode by one of following ways:
The style sheet is pla
ed in a separate le that is available through the web and
it is referen
ed in the HTML do
ument header.
The style sheet is dire
tly inserted into the header of the HTML do
ument.
The de
laration is spe
ied without a sele
tor, dire
tly for a parti
ular tag in the
do
ument as a value of its style attribute. This is
alled inline style denition.
For example, the tag <p style="font-weight: bold;"> starts a paragraph
written in bold.
The use of CSS has grown signi
antly in the last few years, whi
h has been al-
lowed by suÆ
ient support of CSS by the modern web browsers. It is now pra
ti
ally
unimaginable to
reate a modern, well-designed web presentation without the use of
CSS.
2.3 Dynami
Web Content
The modern web te
hnologies allow
reating do
uments that don't only
ontain a stati
text but their
ontents
an be generated dynami
ally upon a
lient request a
ording
to various fa
tors. Basi
ally, the dynami
ontent
an be generated two in ways:
On the server side. There are various te
hnologies that allow generating the
HTML
ode in the time of the
lient request (e.g. CGI, PHP, ASP, JSP, et
.). In
this
ase, no support for dynami
ontent is required on the
lient side, the
lient
always obtains a
omplete HTML do
ument that
an be, however, dierent on
ea
h request.
In the
lient browser. The HTML do
ument that is sent to the
lient
ontains
routines written in a s
ripting language (in most
ases, JavaS
ript language is
used) that are interpreted by the
lient browser. The s
ripting language must be
supported by the browser.
From the do
ument pro
essing point of view, the former
ase doesn't present any
parti
ular problem be
ause the
lient always obtains a
omplete do
ument that
an be
pro
essed dire
tly. The use of JavaS
ript or similar te
hnology presents a serious prob-
lem not only for automati
web
ontent pro
essing tools but also for various alternative
7
browsers su
h as text-oriented browsers or voi
e readers for blind people. For this rea-
son, JavaS
ript should be used as an additional feature only and its support shouldn't
be automati
ally expe
ted by the authors of do
uments. For the same reason, we don't
solve the problems related to the dynami
web
ontent generated on the
lient side in
this thesis.
An additional problem of the dynami
ally generated pages on the server side is
that the generation pro
edure requires
ertain data to be supplied by the
lient when
requesting the page from the server. Typi
ally, this is the
ase of the pages that
are generated a
ording to the values lled-in to an intera
tive form that is available
on another page. Without lling some elds of the form { i.e. without providing a
parti
ular data along with the page request, the dynami
page is not a
essible. The
great amount of su
h pages available through the web is often
alled a hidden web
be
ause its
ontent is not easily a
essible to any automati
web
ontent pro
essing
tools in
luding the sear
h engine indexing robots. Although the hidden web presents
a
onsiderable problem for any automati
pro
essing of the web
ontent, it forms a
separate and quite large area of investigation that is quite distant from the information
extra
tion as su
h. Therefore, in this thesis, we only mention this problem without
dis
ussing it in greater detail.
2.4 Web Servi
es
Web servi
es present a way of using the Internet for appli
ation to appli
ation
om-
muni
ation. They provide a standardized way of spe
ifying the
apabilities and pro-
grammati
interfa
es of the servi
es that are available over the Internet and the
om-
muni
ation proto
ols that allow using the servi
es by other servi
es or appli
ations.
This possibilities originate a distributed servi
e ar
hite
ture where the servi
es
an use
ea
h other for performing a parti
ular task and the inter-servi
e
ommuni
ation and
the results have a standardized,
omputer-pro
essable format.
For example, the Google sear
h engine provides a standard web interfa
e where the
user uses a web browser for downloading an HTML page with a query form; he lls
in the sear
h query and posts the query to the server. The server returns the list of
sear
h results, again as an HTML do
ument. The results in this form are intended to
be displayed only; it would be very
ompli
ated to further pro
ess the query results by
some appli
ation. For this reason, Google provides an alternative interfa
e in a form
of a web servi
e. This servi
e re
eives the query and returns the appropriate results
using the standardized web servi
e proto
ols based on XML so that the results may be
further pro
essed.
The roles in web servi
es and the relations among the proto
ols and the roles are
shown in the gure 2.1. The basi
roles are the servi
e provider and the servi
e requester
(
lient). The requester sends a request and the provider sends ba
k the results both
using the Simple Obje
t A
ess Proto
ol (SOAP). In order to allow the automati
lo
alization of appropriate web servi
es over the Internet, the web servi
e proposal
in
ludes a role of a servi
e broker that maintains a registry of available web servi
es.
Ea
h servi
e publishes information about its purpose and interfa
e to the servi
e broker
8
using the Web Servi
es Des
ription Language (WSDL). The servi
e registry format and
the way of querying are is des
ribed by the UDDI (Universal Des
ription, Dis
overy,
and Integration) standard.
Service
Broker
Publish Find
(WSDL) (UDDI)
Service Service
Provider Bind Requester
(SOAP)
Figure 2.1: The relations among proto ols and roles in web servi es
9
http://www.example.org/staffid/85740
http://www.example.org/terms/address
http://www.example.org/addressid/85740
http://www.example.org/terms/zip http://www.example.org/terms/street
http://www.example.org/terms/state
http://www.example.org/terms/city
10
is and to state, that extra
ting information from the unsophisti
ated HTML do
uments
is
urrently the only way how to a
ess the information available in the web.
11
Chapter 3
12
...
Capital Cities <h1>Capital Cities</h1>
<b>Fran
e</b> - <i>Paris</i>
Fran
e { Paris <b>Japan</b> - <i>Tokyo</i>
Japan { Tokyo ...
13
Wrapper
Extracted data
Extarction rules
HTML Documents
...
Capital Cities <i>Capital Cities</i>
<b>Fran
e</b> - <i>Paris</i>
Fran
e { Paris
<b>Japan</b> - <i>Tokyo</i>
Japan { Tokyo
...
A
ording to Kushmeri
k [38℄, a wrapper is a pro
edure that provides the extra
tion
of parti
ular data in the HTML do
ument as illustrated in the gure 3.2. For the
identi
ation of the parti
ular data in the do
ument, the wrapper uses either a set of
extra
tion rules that dene the way of the identi
ation of ea
h individual data eld or
a model of the do
ument that is used for the de
ision of whi
h part of the do
ument
orresponds to the parti
ular data value (for example, the Hidden Markov Models are
used in some methods). With wrapper
onstru
tion, we mean the pro
ess of formulating
the extra
tion rules or the model of the do
ument for a parti
ular information extra
tion
task.
Kushmeri
k in [38℄ denes six
lasses of wrappers with in
reasing expressiveness
that dier in the way, how the extra
tion rules are dened. The simplest wrapper
lass
is
alled LR (left-right). In this
lass, for ea
h data eld to be extra
ted, one extra
tion
rule is dened. Ea
h rule is a pair of strings that delimit the eld in the do
ument
ode
from the left and from the right. Let's
onsider again an example of a simple do
ument
shown in the gure 3.1. An LR wrapper for this task
an be dened as a set of two
rules:
W = f[ ℄[
< b >; < =b > ; < i >; < =i > ℄g
where the rst rule identies the values of
ountry and the se
ond one identies the
values of
apital .
This way of dening the rules is apparently not usable for all the tasks; let's
onsider
a slightly modied example in the gure 3.3. When the above LR wrapper were invoked
on this do
ument, the
aption would be in
orre
tly identied as a
ity name. As a
14
solution, more
omplex wrapper
lasses have been dened:
HLRT (head-left-right-tail). Two additional delimiters are used; the head de-
limiter is used to skip potentially
onfusing text in do
ument heading and the
tail is used to skip potentially
onfusing text in the bottom of the do
ument.
OCLR (open-
lose-left-right). The open and
lose delimiters identify ea
h re
ord
in the do
ument.
HOCLRT (head-open-
lose-left-right-tail). An analogi
al
ombination of the
above two
lasses.
N-LR and N-HLRT. The modi
ation of the LR and HLRT
lasses for handling
nested tabular data in do
uments.
A
ording to Kushmeri
k, the six wrapper
lasses were able to handle about 70%
of the real web do
uments. It is obvious that the wrapper works properly for a limited
set of do
uments that
orrespond to previously dened extra
tion rules. In literature,
su
h a set of do
uments is
ommonly
alled a do
ument
lass. In most
ases, the
do
ument
lass
onsists of do
uments of the same topi
generated automati
ally from
a ba
k end database by an identi
al pro
edure or at least
reated by the same author.
Moreover, the wrapper works until the data presentation
hanges. As results from a
simple
omparison of the do
ument in the gures 3.1 and 3.3, even a minor
hange
in the do
ument design
an
ause the wrapper to stop working properly. Due to the
distributed and dynami
nature of the Web, this state
annot be predi
ted and sin
e no
additional information about the extra
ted data is provided, it is not trivial to dete
t
the malfun
tion of the wrapper automati
ally.
When using wrappers for integrating information from many sour
es, for ea
h sour
e
a wrapper or wrappers must be
reated and when some
onditions
hange, the wrappers
must be modied appropriately. From this point of view, the method how the wrappers
are being
onstru
ted is important. The most obvious method is writing the wrappers
by hand; i.e. by analyzing a set of do
uments to be pro
essed and determining the
delimiting strings. This method is very time-
onsuming and error-prone; unfortunately,
it is the most used method
urrently. The
ompanies employ people that work on
oding new wrappers and maintaining the old ones. Sin
e this approa
h presents a
serious s
alability problem, many approa
hes have been developed for an automati
inferen
e of wrappers.
Most methods for the automati
wrapper
onstru
tion are based on wrapper in-
du
tion. This approa
h is based on ma
hine learning algorithms and the wrapper
onstru
tion pro
eeds in following phases:
1. A supervisor provides a set of training samples (i.e. labeled HTML pages)
2. A ma
hine learning algorithm is used to learn the extra
tion rules
3. A wrapper is generated based on the extra
tion rules
4. The wrapper is used on the target do
uments
15
html
Country Capital
Fran
e Paris
title h1 table
Japan Tokyo tr tr tr
td td td td td td
The most straightforward model is to represent the do
ument
ode simply as a string
of
hara
ters. In this representation, the text of the do
ument is not expli
itly distin-
guished from the embedded tags. When pro
essing the do
uments represented this way,
usually the extra
tion rules based on delimiting substrings [57℄ or regular expressions
[3℄ are used. For example, the words that end with a
olon
an introdu
e an important
data value. Su
h phrases
an be found using the regular expression [A-Za-z0-9 ℄+[:℄.
For using the ma
hine learning algorithms, it appears that is more suitable to use
a dierent representation where the basi
unit is a word instead of a
hara
ter. The
do
ument is represented as a sequen
e of words [15, 26, 35, 49℄. Ea
h word
an be
assigned various attributes based some of their orthographi
al or lexi
al properties. The
embedded HTML tags
an be either omitted or used for inferring additional attributes
of the individual words [26, 49℄ (e.g. the text is in a
aption). A spe
ial
ase of su
h a
model is used by [35℄. In this model, on the
ontrary, the HTML tags are regarded as
the symbols of an alphabet ; any text string between ea
h pair of subsequent tags is
represented with a reserved symbol . x
Most
ommon is a hierar
hi
al model of the HTML
ode that represents the nesting
of the tags in the do
ument [12, 17, 21, 22, 23, 37℄. Figure 3.4 shows an example of
a simple do
ument and the
orresponding tree of the HTML tags. In order to make
it possible to
reate su
h a model for an HTML, it is ne
essary to pre-pro
ess the
do
ument so that we obtain so
alled well-formed do
ument [12℄ where all opening
tags have the
orresponding
losing tag and they are properly nested. The XHTML
do
uments are always well-formed. The text
ontent of the do
ument is then
ontained
in the leaf nodes of the tree or it is not in
luded in the model at all. It is the advantage
of the hierar
hi
al model that it des
ribes the relations among the tags in addition to
the observed properties of individual words and tags.
16
3.1.3 Wrapper Indu
tion Approa
hes
Current methods of the wrapper indu
tion are based on the knowledge from various
areas of the resear
h work. Many approa
hes are based on grammati
al or automata
inferen
e [18, 25, 35, 37, 38℄, other approa
hes use the relational ma
hine learning
algorithms [17, 21, 26, 57℄. Quite dierent approa
h is presented by the methods based
on
on
eptual modeling [22, 23℄. Note that this is a
oarse
lassi
ation only and the
dierent approa
hes in
uen
e ea
h other.
As mentioned in se
tion 3.1.1, the wrapper indu
tion approa
hes require a set of
labeled examples of the do
uments that are used for inferring the extra
tion rules.
A
ording to the arti
ial intelligen
e terminology, we
all this set of examples a training
set and the pro
ess of inferring the extra
tion rules the wrapper training.
For evaluating the performan
e of the individual information extra
tion approa
hes,
there are two
ommonly used metri
s: the pre
ision and re
all [37℄. They are
P R
dened as follows:
P =
i
(3.1)
R =
n
(3.2)
where is the number of
orre
tly extra
ted re
ords, is the total number of extra
ted
i
ontext). Given a set + of the senten
es over that belong to and a (potentially
S L
empty) set of the senten
es over that do not belong to we want to infer a
S L
The basi
idea behind using grammati
al inferen
e for information extra
tion is
that generating a wrapper for a set of HTML do
uments
orresponds to a problem of
inferring a grammar for the HTML
ode of the pages and nally using the inferred
grammar for the extra
tion of the data elds. This idea is however not dire
tly appli-
able to the do
ument pro
essing. The main obsta
le is that there are only positive
examples available in the web; i.e. the available do
uments. As it follows from Gold's
work [30℄, regular nor
ontext-free grammars
annot be
orre
tly identied from the
positive samples only. This problem
an be basi
ally solved by limiting the language
lass to a sub
lass of regular languages that is identiable in the limit (e.g. -reversible
k
languages) or by
hanging the
omputational model (arti
ial negative samples or sup-
plying an additional information).
One of the approa
hes to the information extra
tion is presented in [25℄. For lo-
ating a parti
ular data eld in the do
ument, a
ombination of a Bayes
lassier and
grammati
al inferen
e is used. The do
ument is modeled as a sequen
e of words. The
17
Bayes
lassier pro
esses parts of the do
ument determined by a
oating window of
a xed length of words. To ea
h position of the window, a probability is assigned
n
that the parti
ular part of the text mat
hes to the parti
ular data eld (e.g. the name
of a person). The problem of this method is in determining the exa
t boundaries of
the data eld (the window of a xed size). Moreover, the
lassi
ation doesn't
on-
sider the word order; the Bayes
lassier only works with the o
urren
e of individual
words. These problems are solved by the grammati
al inferen
e. The words
on-
tained in the do
ument are
onverted to abstra
t symbols from an alphabet using
their orthographi
al properties. Thus the alphabet
ontains symbols of the type
word-lower+dr (an abbreviation \Dr." written arbitrarily in upper-
ase or lower-
ase
letters),
apitalized-p t (any word beginning with
apital letter) et
. The training
set is used for both training the Bayes
lassier and for inferring a nite automaton,
where ea
h terminating state is assigned a probability that the a
epted string
orre-
sponds to the parti
ular data eld. When the string is not a
epted at all, it is assigned
a small probability. The result is then a produ
t of the probabilities from the Bayes
lassier and from the automaton.
Another approa
h to the appli
ation of grammati
al inferen
e is presented by Kosala
[37℄. This method is based on tree languages. The do
ument is modeled a
ording to
its HTML
ode as a tree where the nodes
ontain the name of the HTML tag or a text
string. Instead of a set of strings over we obtain a set of trees over a new alphabet
V where ea
h tree
orresponds to a parti
ular do
ument from the training set. In the
sample trees, the data eld to be extra
ted is repla
ed by a spe
ial symbol . Then, the
x
aim is to infer a deterministi
tree automaton that a
epts the trees in whi
h the desired
data value is repla
ed by . When using this automaton for information extra
tion, we
x
repla
e subsequently the nodes of the do
ument tree by . On
e the resulting tree is
x
a
epted by the automaton, the extra
tion result is the original string that has been
repla
ed by .
x
A dierent way of the grammati
al inferen
e appli
ation is presented by [35℄. This
work is based on sto
hasti
ontext-free grammars. The input alphabet is formed by
the HTML tags and an extra symbol text that represents any non-empty text string
between a pair of tags. During the grammati
al inferen
e pro
ess, the
omplexity of
the grammar is evaluated and the simplest grammar is
hosen. The non-terminals
of the inferred grammar
orrespond to basi
parts of the do
ument. For more exa
t
lo
alization of the data elds, regular expressions are used that represents the domain-
spe
i
knowledge.
A
ompletely new view of the problem is presented by [18℄. The presented approa
h
deals with the problem of s
hema dis
overy { given a set of HTML do
uments we are
looking for a
ommon s
hema of their
ontent and the extra
tion rules based on the
dis
overed s
hema. The s
hema dis
overy is based on
omparing the do
uments { the
parts that are present in all the do
uments are
onsidered as stati
ontent whereas
the
hanging parts
orrespond to the data values.
18
Hidden Markov Models
A Hidden Markov Model (HMM) is a nite state automaton with sto
hasti
state
transitions and symbol emissions. The automaton models a probabilisti
generative
pro
esses whereby a sequen
e of symbols is produ
ed by starting at a designated start
state, transitioning to a new state, emitting a new symbol { and so on until a designated
nal state is rea
hed.
The appli
ation of HMMs to information extra
tion is based on a hypotheti
al as-
sumption that the text of a do
ument has been produ
ed by a sto
hasti
pro
ess and
we attempt to nd a Markov model of this pro
ess. The states of the model are asso
i-
ated to the tokens to be extra
ted. The model transition and emission probabilities are
learned form training data. The information extra
tion is performed by determining
the sequen
e of states that was most likely to have generated the entire do
ument and
extra
ting the symbols that were asso
iated with designated target states.
This approa
h is used for example by [27℄. For ea
h eld to be extra
ted a separate
HMM is used that
onsists of two types of states { the target states that produ
e the
tokens to be extra
ted and ba
kground states. Ea
h of the HMMs models the entire
do
ument so that no pre-pro
essing is needed and the entire text of the do
uments
from the training set is used to train the transition and emission probabilities.
Relational Learning Approa
hes
General prin
iple of these te
hniques is similar to the above. Again, we assume that
there exists a
lass of do
uments with similar properties and that there is a training
set of the do
uments from this
lass available. In the training set, we des
ribe some
properties for ea
h data eld to be extra
ted using logi
al predi
ates. Then, we use
relational learning algorithms for indu
ing general rules that identify the data elds in
do
uments.
Freitag [26℄ assigns ea
h word in the do
uments
ertain attributes based on the
properties of the given portion of the text su
h as word length,
hara
ter type (letters,
digits) or orthography and adds some additional attributes that des
ribe the relation
between the word and the surrounding HTML tags (e.g. the word forms part of a
heading or the word forms a table row). Ea
h data eld to be extra
ted is then des
ribed
by logi
al predi
ates based on these attributes and using an algorithm SRV (based on
the FOIL algorithm [53℄) a general rule is inferred that identies the data eld in the
do
ument. DiPasquo [21℄ extends this approa
h by modeling the hierar
hi
al stru
ture
of HTML tags in the do
ument, whi
h allows des
ribing the relations among the HTML
tags more exa
tly.
Soderland [57℄ uses an approa
h that is based on the methods for natural language
pro
essing. Ea
h word of the do
ument a semanti
lass is assigned (e.g. Time, Day,
Weather
ondition) using predened
on
ept denitions. Using the learning algorithms
AQ and CN2 [16℄ a general des
ription of ea
h data eld is inferred.
19
3.1.4 Alternative Wrapper Constru
tion Approa
hes
Wrapper indu
tion is not the only method for automati
wrapper
onstru
tion. Fol-
lowing te
hniques are based on dire
t analysis of the HTML
ode of the parti
ular
do
uments to be pro
essed. The aim of these te
hniques is to avoid the training phase
and eliminate the requirement of the training set of do
ument. On the other hand,
these te
hniques are usually based on empiri
al heuristi
s and it is often hard to spe
-
ify exa
tly for whi
h do
ument the method is suitable. Furthermore, some predened
domain-spe
i
knowledge is often required and in some
ases (e.g. [33℄) the method
is language dependent.
HTML-related Heuristi
s
These te
hniques are based on spe
i
empiri
al heuristi
s related to the HTML lan-
guage generally or to some generally a
epted ways of its usage.
Ashish [3℄ lo
ates some important words that introdu
e an important information
in the do
ument (e.g. Geography, Transportation, et
.) { so
alled tokens. The tokens
are identied based on the properties of the text and surrounding HTML tags and all
the possible o
urren
es are rmly dened by regular expressions. For example, the
text between the <b> and </b> tags, words in headings, text that ends with a
olon,
et
. Ea
h token indi
ates the start of a se
tion of the do
ument. Next, the hierar
hi
al
stru
ture of se
tions is built by
omparing the font size and the indentation of the
text that begins ea
h se
tion. The proposed extra
tion tool
ontains a graphi
al user
interfa
e for an intera
tive adjustment of the tokens and the hierar
hi
al stru
ture.
Finally, the wrapper is generated using the YACC generator.
Another approa
h is used by [12, 22℄. This approa
h assumes that there
an be
found a unied separator of the data re
ords in the do
ument. The do
ument is modeled
as a tree of tags and based on various heuristi
s, a general stru
ture is dis
overed that
is used as a re
ord separator. As the next step, a data eld separator is lo
ated similar
way. The heuristi
s are based on the statisti
al analysis of the text in potential se
tions,
repeating patterns, et
. Furthermore, a predened knowledge about the meaning of
some HTML tags is used. Similar approa
h is used in [43℄. The proposed MDR
algorithm attempts to lo
ate the regions of the do
ument tag tree that potentially
ontain data re
ords. In these regions, one or more data re
ords
an be identied.
Con
eptual Modeling
Con
eptual modeling approa
h is more
ommon in the area of the information extra
-
tion from plain text do
uments; however, it
an be used for HTML do
uments too.
For example Embley et al. [22, 23℄ proposes a method where as the rst step, an
ontologi
al model of the extra
ted information is
reated and based on this model,
or-
responding data re
ords are dis
overed in the do
ument. It is possible to
ombine this
approa
h with the HTML
ode analysis des
ribed above. The main dieren
e is that
the stru
ture of the information is not inferred from the do
ument but it is known in
advan
e.
20
3.1.5 Computer-aided Manual Wrapper Constru
tion
This
ategory is formed by spe
ial tools that generate wrappers in
ollaboration with
a human expert. These tools usually provide a graphi
al user interfa
e that allows
the wrapper
reator to analyze the do
uments to be pro
essed and to easily design a
wrapper.
The DoNoSe tool [1℄ works mainly with plain text do
uments. The tool allows hier-
ar
hi
al de
omposition of the
ontained data and mapping sele
ted regions of the text
to
omponents of the data model. LiXto [5℄ is a fully visual intera
tive system for the
generation of wrappers based a de
larative language for the denition of HTML/XML
wrappers. Both tools provide a graphi
al user interfa
e that allow the user with no
programming experien
e to produ
e the appropriate wrappers.
3.2 Cas
ading Style Sheets from the IE Perspe
tive
With the new te
hnologies being introdu
ed to WWW, some
riti
al disadvantages
of the wrapper approa
h appear. For example, a
ording to the re
ommendations
of the WWW Consortium the usage of Cas
ading Style Sheets (CSS) [7℄ has grown
signi
antly in the last few years. This te
hnology allows dening the visual layout and
formatting of an HTML or XML do
ument independently on its
ontent. This property
is parti
ularly useful when the HTML do
uments are being generated dynami
ally (e.g.
from a database) sin
e it allows modifying the visual appearan
e of the pages without
modifying the HTML generator. On the other hand, this signi
antly redu
es the
amount of information that
an be used by a wrapper for identifying the information
in HTML do
uments.
Figure 3.5 shows an example of a traditional HTML do
ument formatting. It is
obvious that all the names of the
ountries are denoted by the <b> and </b> tags. A
wrapper for extra
ting
ountries from the do
ument simply looks for these tags and
extra
ts the inserted text.
...
<h1>Capitals</h1>
<b>Fran
e</b>
- <i>Paris</i>
<b>Japan</b>
- <i>Tokyo</i>
...
One of the possible variants of the same do
ument
ode written using CSS is shown
in the gure 3.6. The above method for dening the wrapper fails in this
ase be
ause
all the elements are denoted by the same <span> tag. Moreover, there are several ways
how to in
orporate CSS to the HTML
ode (in our example, we
an see the
lasses
21
dened by the
lass attribute and inline styles dened by the style attribute of the
HTML tags).
...
<span
lass="heading">Capitals</span>
<span
lass="
ountry">Fran
e</span>
- <span style="font-style: itali
;">Paris</span>
<span
lass="
ountry">Japan</span>
- <span style="font-style: itali
;">Tokyo</span>
...
As we
an see, all the HTML tags have been repla
ed with a single <span> tag
that is used for spe
ifying the CSS
lass of the individual parts of the text. Moreover,
as mentioned in se
tion 2.2, the style denitions
an be in
orporated dierent ways to
the HTML
ode. In any
ase, the result is that the \semanti
" HTML tags su
h as
headings, emphasis, et
. may be
ompletely removed from HTML and repla
ed by CSS
denitions. This
hange signi
antly
ompli
ates or even makes unusable most of the
wrapper indu
tion methods mentioned above.
From this point of view, it is not reliable to
onstru
t wrappers that rely dire
tly
on the HTML
ode. HTML and CSS are only the means for
reating do
uments and
they
an be used various ways. The xed point in this variable world is the nal
presentation of the do
ument that has been usually
arefully designed by experts from
various bran
hes and that must be delivered to the reader in un
hanged form regarding
espe
ially the visual appearan
e and the stru
ture of the presented do
ument.
For this reason, instead of modeling the HTML
ode dire
tly, some more sophisti-
ated models need to be proposed that des
ribe the do
uments from the perspe
tive
of its nal presentation. These models should des
ribe the organization and the visual
appearan
e of the do
uments as it is expe
ted to be per
eived by a human reader.
3.3 Advan
ed Do
ument Modeling
3.3.1 Logi
al versus Physi
al Do
uments
The World Wide Web
onsists mainly of HTML do
uments that may referen
e ea
h
other using hypertext links. In this thesis we will
all these do
uments physi
al do
u-
ments be
ause ea
h of them
orresponds to a physi
al le stored on a WWW server.
However, the hypertext links allow splitting a
omplex information to multiple physi
al
do
uments where ea
h of them
ontains a spe
i
part of the information and these
individual parts are inter
onne
ted by the hypertext links. This way of presentation is
very frequent in the World Wide Web. We will
all su
h a set of physi
al do
uments
that form a
omplete information entity a logi
al do
ument. An example of a logi
al
do
ument is given in the gure 3.7. The arrows represent links among the physi
al
22
Figure 3.7: Logi
al do
ument
do
uments. The three physi
al do
uments in the dashed box form a logi
al do
ument.
As we
an see, any of the do
uments that form the logi
al do
ument
an
ontain links
to other external do
uments that do not belong to the logi
al do
uments. Sin
e it is
not spe
ied in the do
uments, whi
h of the links are external, the dis
overy of logi-
al do
uments in the web is not a trivial task and it requires further analysis of the
do
uments and the links.
3.3.2 Logi
al Do
ument Dis
overy
The task of the logi
al do
ument dis
overy
onsists of lo
ating all the physi
al HTML
do
uments that form a logi
al entity
alled logi
al do
uments. The input is the URI of
the main page (sometimes also
alled the top page or the index page), whi
h is intended
by the author of the do
ument to be an entry point to the logi
al do
ument and it is
usually dire
tly a
essed by the users1. The output is the list of the URIs of all the
HTML do
uments that form the logi
al do
ument.
The primary sour
e of information for dis
overing the related physi
al do
uments
are the HTML links. An HTML do
ument that forms part of a logi
al do
ument, ex
ept
the main page, must be referen
ed by at least one other HTML do
ument in the same
logi
al do
ument. The pro
ess of logi
al dis
overy is therefore quite straightforward:
We nd the URIs of all the do
uments referen
ed in the main page
We sele
t all URIs that point to the do
uments that belong to the logi
al do
u-
ment
We repeat the pro
ess re
ursively for ea
h sele
ted do
ument
The major problem that has to be solved is in the se
ond step: How to distinguish
the do
uments that belong to the logi
al do
ument from the remaining ones? There
exist two dierent types of information that
an be used for resolving this task:
1
The URI of the main page is usually publi
ly available, in
ontrast to the URIs of the remaining
do uments
23
Do
ument
lassi
ation. We assume that the individual physi
al do
uments that
form the logi
al do
ument are more similar to ea
h other than the remaining
referen
ed do
uments. The similarity of do
uments is usually
omputed using
the methods based on the term frequen
y in the do
ument su
h as the tf idf
method [55℄.
Do
ument layout analysis. We analyze and
ompare the layout of the do
uments
as mentioned for example in [42℄
Link topology analysis. In general, the topology of the links among a set of
HTML do
uments
an be represented as a dire
ted graph. By the analysis of this
graph using spe
i
heuristi
s we
an dete
t a subgraph with
ertain properties.
This te
hnique is also used for dete
ting so-
alled
ommunities in the web [29℄.
Additionally, some limitations
an be put to the format of the URIs (e.g. all
do
uments must be pla
ed on the same web server, et
.)
Usually, these types of information are used together. Tajima et al. [60℄ shows that
most of the logi
al do
uments are organized into hierar
hies. The approa
h is based on
the assumption that the authors in
lude some hypertext links to the do
uments that are
intended to be used by the users as a standard way of getting to a parti
ular do
ument.
These links form so-
alled standard navigation paths. There are, of
ourse other links
that either point outside of the logi
al do
ument or have a lower importan
e su
h as
links ba
k to the main page. The proposed method of logi
al do
ument dis
overy has
two steps. First, the hierar
hi
al stru
ture is dis
overed by identifying paths intended
by the authors of the do
uments to be the standard navigation routes from the main
page to other pages. Then, the dis
overed hierar
hy is divided into sub-hierar
hies
orresponding to the logi
al do
uments based on
omparing the do
ument similarity
using the method.
tf idf
The information extra
tion methods des
ribed in the se
tion 3.1 have been based on a
dire
t analysis of the
ode of HTML do
uments. We
an say that these methods work
with the physi
al realization of the do
uments. The bottlene
k of this approa
h is too
tight binding of the wrapper to the HTML
ode. The nature of HTML allows to a
hieve
the desired do
ument design by various ways that
an be arbitrarily
ombined, whi
h
makes the wrappers limited to a narrow set of do
uments and a short time period.
As an answer to these drawba
ks, there have been several attempts to des
ribe the
do
uments from a logi
al point of view.
The logi
al stru
ture of a do
ument [2, 13, 58℄ is basi
ally a model of a do
ument
that des
ribes the relations among the logi
al se
tions of the do
ument as for example
se
tions, paragraphs, gures,
aptions, et
. There are generally two approa
hes to the
logi
al do
ument stru
ture dis
overy:
Visual do
ument analysis { we analyze the visual aspe
t of the do
uments. This
approa
h is appli
able to any ele
troni
do
ument format su
h as PostS
ript,
24
PDF, HTML, et
. For HTML do
uments, many approa
hes to the visual analysis
have been published [14, 32, 36, 48, 62℄.
Dire
t analysis of the hierar
hy of tags in the do
ument
ode { we assume that
the nesting of the tags
orresponds to the logi
al stru
ture of the do
ument. This
approa
h is appli
able to the markup languages only. It is more reliable from
the
omplexity point of view, but it is not usable in all
ases (for example when
the Cas
ading Style Sheets are used in
ertain way) sin
e the hierar
hy of tags
needn't ne
essarily
orrespond to the logi
al stru
ture.
In this thesis, with the notion of logi
al do
ument stru
ture we understand the
logi
al stru
ture of any do
ument, either the physi
al or the logi
al one depending on
ontext. The notion of logi
al do
ument stru
ture has been introdu
ed by Summers
[58, 59℄ in the
ontext of pro
essing the PDF, PostS
ript and s
anned-in do
uments
and it is dened as a hierar
hy of segments of the do
ument, ea
h of whi
h
orresponds
to a visually distinguished semanti
omponent of the do
ument. Other authors use
the notions of do
ument stru
ture tree [36℄ or do
ument map [63℄ in similar sense.
For some time, it has been assumed that in
ase of HTML do
uments, there is no
need for modeling the logi
al stru
ture be
ause it is dire
tly present in the do
ument
in the form of the HTML tags. However, a
loser analysis shows that there is only a
very loose binding between the HTML tag tree and the logi
al stru
ture of an HTML
do
ument. The reason is that the HTML provides both stru
tural and presentational
apabilities that
an be arbitrarily
ombined. Furthermore, the eort of the do
ument
authors aims to the resulting visual presentation rather than the logi
ally
orre
t HTML
ode so many tags are often misused. Thus
reating the logi
al stru
ture of an HTML
do
ument is not trivial neither and it requires more detailed analysis of the do
ument.
In almost all works published on the logi
al do
ument stru
ture analysis, the re-
sulting model is a tree, where the nodes
orrespond to individual logi
al parts of the
do
ument. This model is based on the observation that ea
h HTML do
ument
onsists
of elements that spe
ify information
arrying obje
ts at dierent levels of abstra
tion
through obje
t nesting. The logi
al stru
ture
an be viewed as that the obje
ts of a
higher level of abstra
tion are des
ribed by obje
ts of ner levels of abstra
tion [14℄.
Su
h a hierar
hi
al
on
eption of a do
ument organization seems to be natural to the
do
ument authors as well as the readers. This fa
t has been observed by many authors,
for example [1, 4, 14, 20, 49, 58, 60℄ without any more detailed reasoning. We believe
that the main reasons of the hierar
hi
al do
ument organization are:
It is eÆ
ient. Hierar
hi
al organization of a do
ument allows better orientation
in the text. It is possible to nd the se
tion that deals with a parti
ular topi
without having to read the whole do
ument.
It is feasible. A standard do
ument is linear { it has a beginning and the end. The
hierar
hi
al organization
an be easily rea
hed by using various levels of headings
and labels. The organization is then apparent to the reader, espe
ially when the
table of
ontents is in
luded. It is not feasible to a
hieve some more
omplex
organization su
h as a general graph without
onfusing the reader. The situation
25
is dierent in
ase of the logi
al hypertext do
uments; this problem is dis
ussed
separately in se
tion 5.7.
It is natural. The hierar
hi
al organization of a stru
tured text has been widely
used in te
hni
al and popular arti
les and books. People are used to it. This is
a
tually a
onsequen
e of the above two reasons.
For the above reasons, the authors of stru
tured do
uments mostly prefer the hier-
ar
hi
al organization and the readers automati
ally expe
t it. Therefore, a tree appears
to be suÆ
ient for modeling all kinds of do
uments.
3.3.4 Visual Analysis of HTML Do
uments
There are several approa
hes to
reating the model of the logi
al stru
ture for HTML
do
uments. They dier in the granularity of the resulting model that depends on its
intended appli
ation.
The work of Car
hiolo et al. [13℄ deals with the dis
overy of the logi
al s
hema of
a web site that
ontains multiple do
uments. For this purpose, basi
logi
al se
tions
are lo
alized in ea
h do
ument su
h as logi
al heading, logi
al footer and logi
al data
where the semanti
of the page is mainly pla
ed. The proposed approa
h is based
on a bottom-up HTML
ode analysis. First,
olle
tions of similar
ode patterns are
lo
alized in the do
ument. As the se
ond step, ea
h se
tion is assigned a meaning
(e.g. logi
al header) based on the semanti
s of the HTML tags (e.g. the <form> tag
denotes an intera
tive se
tion) or on some information retrieval te
hniques (e.g. the
header se
tion is the
olle
tion that refers to the text in the title of the do
ument
or to the URI of the page). Similarly, [62℄ dis
overs semanti
stru
tures in HTML
do
uments. This approa
h is based on the observation that in most web pages, layout
styles of subtitles or data re
ords of the same
ategory are
onsistent and there are
apparent boundaries between dierent
ategories. First, the visual similarity of the
HTML
ontent obje
t is measured. Then, a pattern dete
tion algorithm is used to
dete
t frequent patterns of visual similarity. Finally, a hierar
hi
al representation of
the do
ument is built. The method des
ribed in [48℄ is based on similar prin
iple. A
key observation of this method is that semanti
ally related items in HTML do
uments
exhibit spatial lo
ality. Again, a tree of HTML tags is built and similar patterns of
HTML tags are dis
overed. Finally, a tree of the dis
overed stru
tures is built.
While the above methods dis
over the logi
al stru
ture of the do
ument to the level
of basi
semanti
blo
ks, the work of Chung et al. [14℄ is more oriented to information
extra
tion. Based on the visual analysis, it attempts lo
ate data elds in the do
uments
and store them in an XML representation. It is assumed that the do
uments being
pro
essed pertain to a parti
ular, relatively narrow ontology. Furthermore,
ertain
domain knowledge provided by the user is ne
essary in the form of topi
on
epts and
optional
on
ept
onstraints. Ea
h
on
ept is des
ribed by a set of
on
ept instan
es
that spe
ify the text patterns and keywords as they might o
ur in topi
spe
i
HTML
do
uments. By
ontrast, the topi
onstraints des
ribe how
on
epts as information
arrying obje
t
an be stru
tured. As the rst step, a majority s
hema of the do
ument
26
is inferred in the form of a do
ument type denition (DTD). Next, the data elds
orresponding to the individual
on
epts are dis
overed.
The last des
ribed approa
h is quite dierent. In [32℄ a method for the web
ontent
stru
ture dis
overy is presented that is based on modeling the resulting page layout
and lo
ating basi
obje
ts in the page through proje
tions and two basi
operations
- blo
k dividing and merging. The proje
tion allows to dete
t visual separators that
divide the page into smaller blo
ks and again, adja
ent blo
k
an be merged, if they
are visually similar. This dividing and merging pro
ess
ontinues re
ursively until the
layout stru
ture of the whole page is
onstru
ted.
27
Chapter 4
28
There are several
auses of these problems. The most important one is too tight
binding of the produ
ed wrapper to the HTML
ode, whi
h makes the wrapper very
sensitive to any minor irregularity in the HTML
ode. All the above mentioned wrapper
indu
tion methods are based on an assumption that there exists some dire
t
orrespon-
den
e between the HTML
ode and its informational
ontent. However, the relevan
e
of this assumption is also quite questionable. As stated in the introdu
tion, HTML
allows dening a visual appearan
e of a do
ument, i.e. a presentation and format-
ting of the
ontained text. The relation between this formatting and the semanti
s
of the do
ument
ontent is not dened anywhere. Wrappers based on some assump-
tions regarding this relation that have been dened based on previous analysis of some
do
uments are therefore based on an information whose relevan
e is not guaranteed
anywhere and it
an be (and usually it is) limited to a
ertain set of do
ument and
a
ertain (unpredi
table) time period. Moreover, the do
ument
an
ontain various
irregularities and spe
ial features so that the rules indu
ed for a do
ument needn't to
be
ompletely appli
able to another do
ument event if it belongs to the same
lass of
do
uments.
As a solution of the above mentioned disadvantages of the wrapper approa
h, we
propose developing a new method that would fulll the following requirements:
1. The do
uments are analyzed in time of extra
tion. The method shouldn't be
based on any features of the do
ument or the set of do
uments that have been
dis
overed in the past. Using information that was relevant at some time in the
past
an lead to in
orre
t results be
ause the do
ument may have been
hanged
in the meantime.
2. Only the features of the logi
al do
ument being pro
essed are used for information
extra
tion. Using some knowledge about the features of other do
uments that are
onsidered to be \similar" to the
urrently pro
essed one
an lead to in
orre
t
results and impre
ision.
3. The method must be independent on the physi
al realization of the do
ument. It
must be based on the nal do
ument appearan
e that must respe
t
ertain rules
rather than the underlying HTML/CSS te
hnology that
an be
hosen arbitrarily.
This in
ludes the situations when the presented information is split to several
physi
al do
uments. Always the whole logi
al do
ument should be analyzed.
These requirements solve the problems of the ne
essity of the training set of do
u-
ments by analyzing ea
h logi
al do
ument individually. Moreover, this feature should
improve the pre
ision of extra
tion be
ause the method is not based on the extra
tion
rules inferred for other do
uments. And lastly, the independen
e on the physi
al re-
alization of the logi
al do
ument should signi
antly redu
e the brittleness of inferred
wrappers. The goals of our work
an be summarized to following points:
1. To nd a suitable model that des
ribes the do
uments on suÆ
ient level of ab-
stra
tion and to dene this model in a formal way.
2. To propose a new method of information extra
tion based on the dened model
that fullls the above requirements and resolves the above mentioned problems.
29
3. To evaluate the proposed method experimentally on real WWW do
uments.
In following se
tions, we present the results of our work that have been a
hieved
while fullling the above spe
ied goals.
30
Chapter 5
31
the reader this information. These rules have been established during the evolution
of typography long before rst ele
troni
do
uments appeared. The most
ommon of
them are following:
The parts of the text that deal with dierent topi
s or that have a dierent pur-
poses are visually separated; e.g. the individual arti
les in a newspaper, footnotes
in a book, et
.
Bold text, itali
s and underlining are used to stress the signi
an
e of a parti
ular
part of the text. In modern typography, dierent
olors of the text are sometimes
used for highlighting a parti
ular word or a senten
e.
The larger font is used for typing a parti
ular text, the more important this text
is. E.g. the most important aairs in newspapers are announ
ed with banner
headlines whereas the less important notes are written with a small font size.
The headings and labels that denote
ertain information are highlighted using
some
ombination of the above means.
In
ase of the world wide web do
uments, these visual instruments are used even more
intensively that in the traditional media. Sin
e the Internet is usually used for qui
kly
and ee
tively obtaining some information, the do
uments must provide great amount
of visual
ues that navigate the reader. Most often, these
ues have the form of high-
lighted headings and labels that denote the meaning of ea
h part of the do
ument.
In this
hapter, we propose an information extra
tion method that is based on
modeling the visual information in the do
ument that is intended to be used by the
readers for qui
k navigation. First, we dene the method for pro
essing a single HTML
do
ument. Later, in se
tion 5.7, we extend the method to logi
al do
uments formed
by multiple HTML do
uments. As stated in the introdu
tion, in this thesis we fo
us
on the data identi
ation phase of the information extra
tion pro
ess.
5.1 Proposed Approa
h Overview
The prin
iple of the proposed approa
h is shown in the gure 5.1. In
ontrast to the
traditional wrapper approa
h (gure 3.2), the information extra
tion pro
ess
onsists
of multiple steps.
As a rst step, we analyze the HTML do
ument that
ontains the data to be
extra
ted and we
reate a model of the visual information as it is expe
ted to be
presented in a standard web browser. This model
onsists of two separate
omponents
{ the model of the page layout and the model of the visual features of the text.
As the next step, we transform these two
omponents to an unied model of the
do
ument logi
al stru
ture. This model des
ribes the do
ument
ontent on signi
antly
more abstra
t level. As dened in se
tion 3.3.3, the logi
al stru
ture only des
ribes hier-
ar
hi
al relations among individual parts of the do
ument
ontent. When we represent
the text
ontent of the do
ument as a text string, (omitting the embedded HTML tags),
the resulting model of the logi
al stru
ture is a tree where ea
h node
ontains a sub-
string of the text of the do
ument and the edges represent relations between a superior
32
HTML Document Structured query
Subtree matching
Extracted data
or more general part of the text (e.g. a heading) and the inferior, more
on
rete part
(e.g. the
hapter
ontents or a data value). More detailed spe
i
ation
an be found
in se
tion 5.5.
After
reating the logi
al stru
ture model, the next step is the information extra
-
tion itself. Sin
e the logi
al stru
ture model is a tree of text strings, the information
extra
tion task
an be formulated as a problem of lo
ating a parti
ular tree node that
ontains the desired information. In our method, we propose dening the information
extra
tion task as a template of a subtree of the logi
al do
ument stru
ture. After this,
the information extra
tion task
onsists of lo
ating all the subtrees of the logi
al stru
-
ture model that
orrespond to the spe
ied template. Ea
h subtree found
orresponds
to a data re
ord in the extra
ted data.
In following se
tions we give a detailed des
ription of the individual steps.
5.2 Visual Information Modeling
As dis
ussed in previous se
tions, the do
uments in the World Wide Web have more
or less hierar
hi
al organization so that the user
an ee
tively lo
ate the desired in-
formation in the do
ument. This hierar
hi
al organization is expressed by two basi
means:
1. By splitting the do
ument to several visually separated parts that
an be arbi-
trarily nested; i.e. the page layout.
2. By providing a hierar
hy of headings and labels of dierent level of abstra
tion
33
HTML Document
that des
ribe the
ontents of a part of the do
ument or the meaning of a par-
ti
ular data presented in the do
ument. This hierar
hy is expressed by various
typographi
al attributes of the text; e.g. the more important is the heading, the
larger font size is used, et
.
In order to obtain a logi
al stru
ture of a do
ument we have to
reate the models of
both the
omponents of the visual information: the page layout and the typographi
al
attributes and to transform these models to the model of the logi
al do
ument stru
ture
as shown in Figure 5.2.
5.2.1 Modeling the Page Layout
In order to
learly distinguish dierent kinds of information in the do
ument, the web
pages are usually split to multiple areas. Figure 5.3 shows a typi
al example of a
do
ument split to several visual areas. We
an noti
e three basi
visual areas { the
header on the top of the page, the left
olumn with a navigation menu and the main
part that
arries the informational
ontents of the page. The main part is further split
by a horizontal separator in two parts { the main
ontents and the footer. We
an see
that some areas are nested and thus there is a hierar
hy of visual area present in the
do
ument.
Generally, the visual areas in a do
ument
an be visually expressed using various
visual separators (horizontal rules, boxes, et
.) or by dierent visual properties; the
most usual one is dierent ba
kground
olor. In HTML, it is not diÆ
ult to enumerate
all the possible means that
an be used for
reating a visually separated are in a
do
ument. As follows from the HTML spe
i
ation [54℄, the set of available means
is quite limited. To be spe
i
, su
h an area
an be
reated using following HTML
onstru
tions:
34
Figure 5.3: Visual areas in a do
ument
35
Page obje
t HTML tags
Do
ument <html>
Table, table
ell <table> <th> <td>
List, list item <ul> <ol> <dl> <li> <dt> <dd>
Paragraph <p>
Generi
area <div>
Frames <frameset> <frame>
Horizontal rule <hr>
Table 5.1: Visual areas in HTML and
orresponding tags
For the purpose of modeling the page layout we dene the notion of visual area as
any area in the do
ument that is formed by one of the mentioned means independently
on its visual attributes. For example, we
onsider ea
h table
ell a separate visual area
independently on whether the surrounding table
ells have dierent ba
kground
olor or
whether they are separated by a bounding box. The visual areas in a do
ument form
a hierar
hy where the root represents the whole do
ument and the remaining nodes
represent the visual areas in the do
ument that
an be possibly nested. For modeling
the page layout, we assign ea
h visual area an unique numeri
identier 2 , where vi I
the whole do
ument has 0 = 0 and the remaining areas have = 1 + 1. Then, the
v vi vi
layout of the page
an be modeled as a tree of area identiers as shown in the gure
vi
5.4.
v0
v1 v5
v2 v3 v6
v4
tree and ( ) 2 i and are the visual area identiers and the area identied
vi ; vj El vi vj
This model doesn't
ontain any information about the visual attributes of the areas,
it only represents the way in whi
h the visual areas are nested. However, by assigning
individual parts of the do
ument text to the visual areas we
an obtain the information
about what parts of the text are related. The model thus represents the most important
information the page layout gives to the reader as dis
ussed above.
36
Figure 5.5: Visual attributes of the text
Various visual attributes of
ertain portions of the text are often used for emphasizing
the importan
e of the parti
ular text portion or, on the
ontrary, to suppress its im-
portan
e. Another
ommon use of the visual attributes is to distinguish the headers
and labels in a do
ument from the remaining text. An illustration of this fa
t is given
in the gure 5.5. In the text of the rst se
tion, the words \weight" and \markedness"
are emphasized by using itali
s. Similarly, the word \weight" in the se
ond se
tion is
emphasized by using the bold font. There are two levels of headings in the do
ument
that are distinguished by the font size. The rst-level heading gives the title of the
do
ument whereas the two se
ond-level headings are used for the titles of the se
tions.
Furthermore, there is a label \Number of
ars sold" that introdu
es the value of \125".
The fun
tion of the label is indi
ated by the trailing
olon.
For the purpose of automati
do
ument pro
essing, we need to
reate a model of the
do
uments des
ribing the mentioned features of the text. This model has to des
ribe
these features on an abstra
t level. For example, we want to des
ribe the fa
t, that
the word \weight" is emphasized in both o
urren
es. The used mean, i.e. if the
emphasizing is a
hieved using the bold font or itali
s is not signi
ant from our point
of view. The same ee
t
an be a
hieved for example by typing this word in red
olor.
Similarly, the font size used for the word \Introdu
tion" is not important, we want to
des
ribe that this heading has a lower level than the main do
ument heading but it is
at higher level that the remaining text.
An HTML do
ument
onsists of the text
ontent and the embedded HTML tags.
Some of the tags
an spe
ify the typographi
al attributes of the text. Let's denote
Thtml a set of all possible HTML tags and an innite set of all possible text strings
S
37
an be represented as a string of the form
= 112233 D +1
T s T s T s (5.2)
: : : Tn sn Tn
where 2 is a text string with the length j j 0 that doesn't
ontain any embedded
si S si >
Sin
e the visual attributes of the text
an only be modied by the HTML tags,
ea
h text string has
onstant values of all the attributes. Let's dene the notion of
si
text element as a text string with visual attributes. Ea h text element 2 , where ei E
E = . As usual, we write as
S I I I ei
=( ) ei si ; vi ; xi ; wi(5.4)
and we dene
1 1 +1
ei < ej ; i n (5.5)
;i j n
element, is the identier of the visual area the element belongs to and and
vi xi wi
appropriate text string in (5.2). is determined during the tree of visual areas is being
vi
built. The element 1 always
ontains the do
ument title as dis
ussed further in se
tion
e
5.4.
The Markedness of a Text Element
The visual appearan
e of a text element
an be determined by interpreting the HTML
ei
tags in , and
orresponding CSS styles (this is basi
ally what the web browsers
Tj j i
38
Font weight. Although more degrees of the font weight
an be distinguished, from
the point of view of a reader we
an distinguish two of them { normal and bold.
The text element written in bold is always more visually expressive as a normal
text with the same attributes.
Font style. The font style
an be normal or itali
. It is also possible to use the font
style
alled oblique or slanted. This style is however very similar to itali
and it is
usually not distinguished by the readers even nor by some web browsers. Writing
the text element in itali
s is often used to rea
h its higher visual markedness.
Text de
oration. The text element
an be underlined in order to in
rease its
markedness or, on the
ontrary, striked-out to de
rease its markedness. Overlined
elements are usually not used and there is no
lear interpretation of overlining.
Text
olor. The do
ument text is usually written in one
olor. Dierent
olors
an be however used for highlighting some parts of the text. A
ording to this,
any text element written in other than default
olor higher visual markedness.
Based on the above visual properties, we dene following heuristi
for
omputing
the value of markedness of a text element:
= ( + + + + ) (1 )
x F f b o u
z (5.7)
where , , , and have the value 1 when the text element is bold, oblique, un-
b o u
z
derlined,
olor highlighted or striked-out respe
tively and 0 if they are not. is the
f
dieren
e f where is the font size of the element and is the default font size
fd f fd
for the do
ument. The
onstant denes the relation between the text size and its
F
markedness. For 4 the element with greater font size is always more important
F >
than the element with lower font size. This
orresponds to the usual interpretation.
Element Weight
Some of the text elements are used as headings or labels that des
ribe the
ontents of
a
orresponding se
tion of the do
ument on a higher level of abstra
tion. The weight
of the element des
ribes the position of the element in the hierar
hy of headings. The
main title of the do
ument has the highest weight while the normal text has the weight
of zero. For determining the element weight we analyze following fa
tors:
The markedness of the element. The headings are usually written in larger font
or at least in bold. The more expressive is the text element the higher should be
its weight.
Element position. The position of the element is
riti
al for de
iding if the element
is a heading or it is not. An element that lies in the
ontinuous blo
k of the text
annot be
onsidered as a heading. For example the word \weight" in the gure
5.5
annot be
onsidered as a heading even if it has greater markedness that the
surrounding text. Generally, we
an say that the must be pla
ed at least at the
beginning of a line. An element is pla
ed at the beginning of a line i any of
en
the tags in
auses a line break. Su
h tags are dened in see [54℄.
Tn
39
The pun
tuation
an be used for denoting the term-des
ription or
Pun
tuation.
property-value pairs of elements. The elements that
ontain the text ending with
the
olon should have higher weight than the same element that doesn't end with
the
olon.
Considering these fa
tors we
an dene the weight of a text element based on a
heuristi
similar to the denition of markedness:
h
= ( ) + ( + + + ) + (1 )
w F f b o u
i
l W p (5.8)
z
(5.7), and have the value 1 when the element follows a line break and when the
l p
element text ends with a
olon respe
tively and is the weight of the nal
olon.
W
An element that ends with a
olon should have higher weight that possibly following
element written in bold, underlined or highlighted by
olor. This
ondition holds for
W > 4.
5.3 Representing the Hypertext Links
As it follows from the hypertext nature of the world wide web, ea
h do
ument
an
ontain links to another do
ument. These links don't form part of the text
ontent of
the do
ument but they
an be spe
ied using a spe
ial HTML tag <a>. Therefore, the
links are not dire
tly in
luded in the visual information model1, they may be however
useful for the dis
overy of the logi
al do
uments that is going to be dis
ussed below in
se
tion 5.7. For this purpose, it is ne
essary to represent the in
luded hypertext links
in some way.
In HTML, the link is assigned to a
ontinuous portion of the HTML text usually
alled an
hor. This text is visually distinguished in the do
ument and it is assigned a
target URI that points to another do
ument. In our do
ument representation, the text
of the do
ument
onsists of text elements. Thus, we
an dene a representation of a
link as
=( l) uri; T (5.9)
where is the target URI of the link and is a non-empty set of text elements that
uri T
form the an
hor. Sin
e the an
hor must be
ontinuous, always
ontains one or more
T
40
the HTML
ode of the do
ument. The pro
ess of obtaining the ne
essary data
an be
split to following steps:
1. Obtaining of the do
ument(s)
2. Tag to CSS
onversion
3. Visual area and text element identi
ation
4. Unit uni
ation
5. Computation of the markedness and weight
The way in whi
h the do
uments are obtained is not signi
ant for the do
ument
modeling. Usually, the do
uments are retrieved through a network using the HTTP
proto
ol. For the analysis, we have to retrieve the HTML do
ument itself and all the
eventual style sheets referen
ed in the do
ument. When the frames are used, it is also
ne
essary to obtain the
ode of all the frames.
As the next step, we go through the HTML
ode and we
onvert all the tags
that modify the values of the visual attributes to CSS spe
i
ations. For example,
we remove all the <b> and </b> pairs from the do
ument
ode and we repla
e it
with <span style="font-weight: bold">. Similarly, some visual attribute
an be
in
uen
ed by multiple CSS properties or by dierent spe
i
ation methods (absolute
or relative value, et
.). In this phase, we
onvert the spe
i
ation using HTML tags
and CSS properties to a single CSS property for ea
h visual attribute. For ea
h tag
en
ountered, we
he
k also the do
ument-wide style spe
i
ation that may be also
in
uen
ed by the id or
lass attributes of the tag. The resulting CSS style denitions
are inserted ba
k to the do
ument using the style attribute of the generi
<span> and
<div> tags. Various methods and units of the value spe
i
ation are unied later, in the
unit uni
ation phase. Table 5.2 shows a list of all HTML tags and the
orresponding
CSS properties that in
uen
e the values of individual visual attributes of the text. We
onvert the spe
i
ations as follows:
Headings are
onverted to a text style with the property font-weight: set to
bold and the font-size: property is set a
ording to the level of the heading.
A
ording to the standard behaviour of the web browsers, the headings of level
1 to 6 have the font size set to xx-large, x-large, large, medium, small and
x-small.
Table headers spe
ied using the <th> tag have the property font-weight: set
to bold.
Links, i.e. the text that is used as a hypertext link to another do
uments have
the property font-de
oration: set to underline.
Font spe
i
ations using the <font> tag
an be used for modifying the font fa
e,
size and
olor for a part of the text. The font size denitions is
onverted to the
font-size: property spe
i
ation in CSS and the
olor attribute is
onverted to
the
olor: property. We don't deal with the font fa
e in our method.
41
Attribute CSS properties HTML tags
Size font-size:, <big>, <font>, <h1> - <h6>,
line-height: <small>
Weight font-weight: <b>, <h1> - <h6>, <th>,
<strong>
Style font-style: <em> <i>
De
oration text-de
oration: <s>, <strike>, <u>
Color
olor: <font>
Table 5.2: Text Visual Attributes
Bold text spe
ied using the <b> and <strong> tags is
onverted to font-weight:
boldspe
i
ation.
Itali
s spe
ied using the <i> and <em> tags are
onverted to font-style:
itali
.
is spe
ied using the text-de
oration: property
Underlined and overlined text
set to underline or line-through respe
tively.
The visual properties of some page obje
ts su
h as headings and table headers are not
exa
tly dened in the HTML or CSS spe
i
ation. The resulting visual appearan
e of
these elements depends to
ertain extent on the used web browser and its
ongura-
tion. The
onversion rules listed above have been observed on popular web browsers
Mozilla2 version 1.6 and Konqueror3 version 3.2.2. Espe
ially the rst one is
om-
monly
onsidered to implement all the web standards and re
ommendations the most
pre
isely. Anyway, sin
e
ertain variety in interpreting the HTML
ode by dierent
web browsers shouldn't in
uen
e the do
ument understanding by the user, the men-
tioned rules present one of the possible variants that is however a
eptable and de fa
to
standard.
The next phase
onsists of dis
overing the visual areas and the text elements in the
do
ument. These two tasks are done simultaneously during single-pass analysis of the
HTML
ode. The output of this phase is
1. The tree of visual area identiers that
orresponds to the denition (5.1).
Ml vn
A styled element is basi
ally a text string with an assigned visual area identier and
the style. The style
an be represented as a tuple
=(
style )
f size; f weight; f style; f de
oration;
olor (5.11)
where the individual elements
orrespond to the appropriate CSS properties. Then, a
styled element
an be dened as
=(
esi )
si ; vi ; stylei (5.12)
2
http://www.mozilla.org
3
http://konqueror.kde.org
42
where is the text
ontents of the element, is the identier of the assigned visual
si vi
During the HTML
ode analysis, we read subsequently the HTML tags from the
ode of the do
ument. For
reating the tree of visual area identiers , we maintain
Ml
the sta
k of
urrently open visual areas. At the beginning,
ontains the identier
Sv Sv
v 0 of the root area and Ml
ontains only 0 as the root element. When a tag is read
v
that implies the start of a new area (any tag from the table 5.1), a new area is open,
it is assigned a new identier . This identier is added to as a son node of the
vi Ml
identier being
urrently on the top of and afterwards, is added to the top of .
Sv vi Sv
When the
orresponding
losing tag is en
ountered, the area identier on the top of
the sta
k is removed from the sta
k.
For determining the visual attributes of the elements we maintain a sta
k of the
Ss
styles. At the beginning, the sta
k
ontains a default style. When any opening tag is
en
ountered in the
ode, the top of the sta
k is dupli
ated. The style on the top of the
sta
k is updated a
ording to the available style denitions for the tag being pro
essed.
When a
orresponding
losing tag is en
ountered, the top of the sta
k is removed.
Ss
The style of the element
an be spe
ied by any CSS spe
i
ation (see the overview in
se
tion 2.2). This spe
i
ation is parti
ularly important for the hypertext links, that
are usually underlined by default but any other visual ee
t
an be spe
ied using CSS
denitions for the <a> tags.
Any non-empty text string en
ountered between two subsequent tags forms a new
styled element. The rst styled element 1 is always formed by the do
ument title
es
spe
ied by the <title> tag. If no title is spe
ied the element
ontains an empty
string 1. The weight of the title element should be always greater than the weight of
s
any other text element in the page. Ea
h new element is assigned the identier of the
visual area that is
urrently on the top of the sta
k of visual areas and the style
Sv
that is
urrently on the top of the sta
k . Then, the new styled element is appended
Ss
In the unit uni
ation phase, the values of size and the
olor of the style of all the
styled elements from are
onverted to a unied form. In CSS, the font size
an be
Ms
from by
omputing the values of markedness and weight for all the styled elements.
Ms
For ea
h styled element 2 we
reate a new text element by
opying the values
es i Ms ei
of and and
omputing the values of the markedness a
ording to the denition
si vi xi
(5.7) and the weight a
ording to the denition (5.8). Then, the new element is
wi ei
appended to . Mt
43
5.4.1 Tables in HTML
44
0
1
0 2 3
1 2 3 4 5 4 5
ells using the <th> or <thead> tags, this feature is however not
ommonly used and
in some
ases it is used in
orre
tly. Therefore, we propose determining the headers by
omparing the visual appearan
e of the table
ells:
We assume that the header is formed by the top row or the leftmost
olumn of
the table.
All the
ells in the header must have
onsistent visual style: font,
olor and
ba
kground.
When a header row or
olumn is en
ountered, we add ea
h header
ell to the tree
of visual areas as
hild nodes of the table visual area. After this, we
onsider ea
h
part of the table that is
overed by a single header
ell as a separate subtable, that
is formed by one or more rows (when the header
ells are in the rst
olumn)
or
olumns (headers in the rst row) and we repeat the header identi
ation
pro
ess re
ursively on all the subtables. The pro
ess ends when no headers
an
be identied in the subtables. In this
ase, we assume that the remaining
ells
only
ontain the data values.
At the end of this pro
ess we obtain a hierar
hy of headers. In the next step, we add
the all the
ells of the remaining subtables to the visual model as the
hild nodes of the
respe
tive header
ell they belong to. An illustration of this pro
ess is in the gure 5.8.
In steps 1 and 2 we dete
t the headers of the (sub)tables. In step 3, no more headers
an be dete
ted and the data
ells are just in
luded to the tree.
5.4.2 Example of Visual Models
In order to demonstrate the HTML parsing pro
ess, let's
onsider the simple do
ument
shown in the gure 5.9. The HTML tags that in
uen
e the visual appearan
e of the
text elements are shown in the shaded boxes.
The root visual area is formed by the do
ument. The only obje
t that introdu
es
sub-areas to the do
ument is a simple table. In this table, a stru
ture
an be dete
ted.
The resulting tree of visual areas is shown in the gure 5.10.
Table 5.3 shows the visual attributes of all the text elements in the page and the
values of markedness and weight
omputed from the denitions (5.7) and (5.8) for = 5 F
and = 5. As stated in the se
tion 5.2.2, the rst element 1 is always the title of the
W e
45
0
Step 1 1
Step 2 1
2 3
1
Step 3 2 3
5 6
orresponding values in ea h line of the table. Note that the visual area identier of v
interpretation of their visual attributes and their position in the do
ument. In order
to ensure that is a tree, it must hold that ea
h text element ex
ept 1 whi
h is the
S e
page title has exa
tly one dire
tly superior element (its parent element in the tree).
The element 1 always forms the root node of the tree.
e
46
Figure 5.9: Sample do
ument
The set of edges of the tree is derived from the page layout model (5.1) and
ES Ml
the model of typographi
al attributes of the text (5.6) by an algorithm that
onsists
Mt
of its des endants if needed, so that the element of a higher weight is always an
47
s f b o u
z l p v x w
Stru
ture 0 0 0 0 0 0 0 0 0 0 0
Sample do
ument 8 1 0 0 0 0 1 0 0 41 41
This is the beginning. 0 0 0 0 0 0 1 0 0 0 0
Text 5 1 0 0 0 0 1 0 0 26 26
Various text follows: 0 0 0 0 0 0 1 1 0 0 5
bold 0 1 0 0 0 0 0 0 0 1 0
itali
s 0 0 1 0 0 0 0 0 0 1 0
and underlined 0 0 1 1 0 0 0 0 0 2 0
There
an also be 0 0 0 0 0 0 1 0 0 0 0
some link 0 1 0 0 0 0 0 0 0 1 0
. 0 0 0 0 0 0 0 0 0 0 0
Table 5 1 0 0 0 0 1 0 0 26 26
Some tabular data: 0 0 0 0 0 0 1 1 0 0 5
Name 0 1 0 0 0 0 1 0 4 1 1
John Smith 0 0 0 0 0 0 1 0 6 0 0
E-mail 0 1 0 0 0 0 1 0 7 1 1
johnjohnsmith.
om 0 0 0 1 0 0 1 0 9 1 1
Table 5.3: Text elements in the sample do
ument
an
estor of all the elements with a lower weight within the visual area.
The algorithms for both steps follow.
Algorithm 1 Creating a frame of the logi
al stru
ture
Input: VS=f 1 2 g { the set of text elements
e ; e ; : : : en
urrent = 1e
For ea
h = 2 3
ei ; =(
e ; e ; : : : en ei ) do
si ; vi ; xi ; wi
if 6= 1 then
vi vi
= 1
urrent ei
else
is the nearest su
h that
urrent ej
is an
estor of 1 in
ej ei SV and
is an
estor of in
vj vi Ml
Add ( ) to
urrent; ei EV
48
Figure 5.11: Sample do
ument with the
orresponding logi
al stru
ture
ep 2 VS is a root of SV
1 2 2V
e
; e
; : : : e
n are
hild nodes of ;
S ep
Output: S VS ; ES
if n > 0 then
urrent = ep
Add ( ep ; e
1 ) to E
= e 2; e 3; : : : e
; =( ) do
S
For ea h e e i si ; vi ; xi ; wi
ep e i
if 1 then
wi < wi
= 1
urrent e
i
if 1 then
wi > wi
= nearest su
h that
urrent e
j
is an
estor of 1 and
e
j e
i
is an
estor of and
ep e
j
w j > wi
Add ( ) to
urrent; e
i ES
The resulting model of the logi
al stru
ture exhibits an important feature that
an
be des
ribed as follows. Let be the whole text of an HTML do
ument that
an be
text
obtained by taking the
ontents of the <body> tag of the do
ument and dis
arding all
the embedded HTML tags. Let 1 2 be a string of text elements obtained by a
e ; e ; : : : en
pre-order traversal of the resulting tree and let be the text string
ontained in the
S si
49
text element . Then, it holds that
ei
text =
Y n
si (5.14)
i =2
In other words, when
on
atenating all the text strings ex
ept the rst one (the title)
in the logi
al do
ument stru
ture during the pre-order traversal of the tree, we obtain
the
omplete text of the do
ument.
5.6 Information Extra
tion from the Logi
al Model
As dened in the previous se
tion, the logi
al stru
ture of a do
ument is a tree whose
nodes are formed by all the text elements
ontained in the do
ument. When using the
logi
al stru
ture model for information extra
tion, the task is to lo
ate the nodes of the
tree that
ontain the desired information. We assume that the do
ument is designed to
be understandable by a user and we analyze the way in whi
h the visual information
is used by the user when looking for a parti
ular data in the do
ument.
The rst assumption is that the user usually knows an approximate form of the
information he is looking for. For example, when looking for a pri
e, we are trying to
nd a number pre
eded or followed by a
ode of
urren
y. The se
ond assumption is
that there is a hierar
hy of headings and labels provided by the author of the do
ument
that allows the user to interpret the do
ument
ontents ee
tively. The user is thus
provided with three types of navigational
ues:
1. The relations among the text elements in the do
ument that
orrespond to the
logi
al do
ument stru
ture and that are expressed using the visual design of the
do
ument.
2. The labels provided by the author of the do
ument.
3. The expe
ted format of the desired information.
When looking for a parti
ular information, the user typi
ally navigates through
the logi
al stru
ture of the do
ument as suggested by the visual means and tries to
hoose the path leading to the desired information by
lassifying the headings and
labels that in our
ase
orrespond to the nodes of the stru
ture tree. Normal user
an
use the natural language understanding for
hoosing the best path. Sin
e the natural
language pro
essing is a non-trivial task, we propose a simplied solution based on
regular expressions and approximate tree mat
hing algorithms.
Let's return ba
k to the original idea of information extra
tion from data-intensive
do
uments that is based on the extra
tion of the
omplete data re
ords rather than
on extra
ting the single data values. Further we will assume that ea
h data value,
when it is present in the do
ument, forms exa
tly one text element. This assumption
is somewhat simplifying be
ause it doesn't allow the situations when the information
is split to several text elements or, on the
ontrary, when there is an additional text
present in the same text element. However, as follows from the text element denition
(5.4) on the page 38, this assumption
orresponds to the following situation:
50
The data value is visually
onsistent (thus it forms a single text element)
The data value is visually separated from the surrounding text by any HTML tag
so that the surrounding text doesn't form part of the text element
In the data intensive do
uments, the a
eptan
e of these rules
an be expe
ted. The
impa
t of this simpli
ation to the performan
e of the information extra
tion method
is dis
ussed in the evaluation part of this thesis.
Let's assume that ea
h data re
ord to be extra
ted
onsists of j j data values (an
r r
example of su
h a data re
ord is given on the page 3). When we admit that some
data values
an be missing in the do
ument, when extra
ting the data re
ord, the
task is to lo
ate text elements in the logi
al do
ument stru
ture that
orrespond to
m
that must be lo
ated for
onsidering the data re
ord to be found. All the lo
ated text
elements that form a single data re
ord are lo
ated in
ertain subtree of the logi
al
stru
ture tree. Aside from the data values, this subtree
ontains the text elements that
orrespond to the labels that denote the value in the do
uments and some remaining
text elements that
an
ontain additional notes et
.
From this perspe
tive, the information extra
tion task
an be viewed as a task of
lo
ating the subtrees of logi
al stru
ture tree that meet following requirements:
The subtree
ontains the data values of the expe
ted form and the expe
ted labels.
The data values are logi
ally related to the appropriate labels; i.e. the text ele-
ment
ontaining the potential data value is a des
endant of an element
ontaining
the appropriate label.
For lo
ating the appropriate subtrees of the logi
al stru
ture tree, we propose an
approa
h based on tree mat
hing algorithms.
5.6.1 Using Tree Mat
hing
The proposed approa
h is based on the spe
i
ation of a template of a subtree that is
to be lo
ated. This spe
i
ation
onsists of two steps:
1. We spe
ify the expe
ted logi
al stru
ture of the extra
ted data re
ord
2. We add the information about the expe
ted labels and the format of the data
values. Regular expressions are used for this spe
i
ation.
The result of this spe
i
ation is a template of a tree that
an be interpreted as a
stru
tured query to the database of all the subtrees of the logi
al stru
ture tree.
Let's
onsider a simple example in the Figure 5.12. From a personal page, we
want to extra
t the name, department and the e-mail address of that person. The
left tree denes an expe
ted logi
al stru
ture of the extra
ted information. In
ase
of the personal pages, the name of the person is usually presented as a superior text
(heading) and the remaining data is pla
ed further in the do
ument. We extend the tree
by repla
ing the elds with the regular expressions that denote their expe
ted format
51
^[a−zA−Z\ \.]+$
Name
[Dd]epartment [Ee]−?mail
E−mail ^[A−Za−z0−9_\.]+@[A−Za−z0−9_\.]+$
and we add the expressions that denote expe
ted labels of these data. The resulting
tree
an be viewed as a stru
tured query and tree mat
hing algorithms
an be used for
identifying all mat
hing subtrees of the logi
al do
ument stru
ture.
For simple extra
tion tasks, we
an simply estimate the logi
al stru
ture of the
data re
ords as shown on the previous example. For more
omplex task, it is ne
essary
to obtain at least one sample do
ument in order to determine the logi
al stru
ture
of the
ontained data re
ords. Furthermore, there usually exist several variants of
stru
turing the data re
ords. For example, the previous example shown in the gure
5.12
ould be
hanged so that the name is presented at the same level as the remaining
data. For this reason, it may be ne
essary to modify the expe
ted data stru
ture when
pro
essing a new set of do
uments. By adapting the extra
tion task for further sets of
do
uments, we obtain several variants of the template subtree that
over all possible
variants of presenting the extra
ted data. However, in
omparison to wrappers, where
the number of possible variants is extremely high and therefore the wrappers must be
re-generated for ea
h set of do
uments, in our
ase, there are only a few variants of the
logi
al stru
ture and with the in
reasing number of pro
essed data sets, the frequen
y
of the modi
ations de
reases rapidly. For example, for the sample task shown in the
gure 5.12, two variants of the task spe
i
ation are suÆ
ient as shown in the method
evaluation part in se
tion 7.1.1.
5.6.2 Approximate Unordered Tree Mat
hing Algorithm
For tree mat
hing, we use a modi
ation of the pathx algorithm published by Shasha
et al. [56℄. This algorithm solves approximate sear
hing in unordered trees based on
the root-to-leaf path mat
hing. The original problem is to nd all data trees from D
a database D that
ontain a substru
ture 0 whi
h approximately mat
hes a query
D
root-to-leaf paths from that do not appear in 0. Let's
onsider an example of the
Q D
query tree in Figure 5.13, whi
h has two paths. The query tee mat
hes data tree 1
Q D
with distan
e of 0 (both paths from are present in 1 ). For the tree 2 , there are
Q D D
two possible mat
hes with distan
e of 1 (the path A-D is not present in the tree but
the path A-B has two o
urren
es in 2 ).
D
In order to use this algorithm for our purpose, we introdu ed following modi a-
52
Q D1 D2
C A
A
B A B A
B D
B D B C
tions:
Instead of the database of trees we have only one tree (the logi
al stru
ture tree)
and we want to lo
ate the subtrees (data re
ords) that mat
h the query tree. Thus
in our
ase, the database D is formed by all the subtrees of the logi
al do
ument
stru
ture with the root node that mat
hes the root node of the query tree.
S
The values in the nodes of the trees don't have to mat
h exa
tly sin
e the query
tree
ontains regular expressions in its nodes. We dene, that a node in the
Q
logi
al stru
ture tree mat
hes a node in the query tree i the appropriate
S Q
text element mat
hes the regular expression in the query tree node.
Sin
e the exa
t form of the logi
al do
ument stru
ture may dier in some range
among the web sites, we have extended the path mat
hing algorithm by intro-
du
ing two more parameters:
{ QS K I P { number of nodes from the query path that don't mat
h any node
from the mat
hed path in . This
orresponds to the situation that some
S
labels or data values that are expe
ted by the tree template but they are
missing in the do
ument.
{ M SK I P { number of nodes in the mat
hed path from that don't mat
h
S
any node in the query path. This
orresponds to the situation that there
are some extra nodes in the logi
al stru
ture tree that are not expe
ted by
the tree template.
These parameters allow to
onsider the paths as they were mat
hing event if there
is
ertain number of missing or ex
essive nodes.
The a
ura
y of the results depends on the values of the and
QS K I P M SK I P
parameters and on the allowed number of missing paths . More detailed analysis
DI F F
53
of the in
uen
e of the parameters to the pre
ision and re
all of the method is given in
Chapter 7.
5.7 Information Extra
tion from Logi
al Do
uments
Analogous to the physi
al do
uments, the logi
al do
uments also exhibit a hierar
hi-
al organization for similar reasons (see the dis
ussion at page 25, se
tion 3.3.3). In
ontrast to the physi
al do
uments, the hypertext allows to give the logi
al do
ument
any organization that
an be modeled using a dire
ted graph. On the other hand, the
organization must be easily understandable to the do
ument users. From the pra
ti
al
point of view,
onsidering this requirement, it is not advisable to use a more
omplex
organization than the hierar
hi
al one. As observed by [60℄, when we extra
t only
the links intended to be the routes through whi
h the readers go forward within a
do
ument, in most
ases, we obtain a sequen
e or a hierar
hy of pages.
Let's represent the logi
al do
ument as a tuple
D
=( )
D Dp ; I (5.15)
where is a set
Dp
Dp =f 1 2 g
d ; d ; : : : ; dn (5.16)
where ;1 are the physi
al do
uments and 2 is the main (index) page. In
di i n I Dp
the
ontext of using the information extra
tion method des
ribed in the above
hapters,
the physi
al do
ument
an be dened as a tuple
di
di =( )
urii ; Si ; Li (5.17)
where is the unied resour
e identier of the physi
al do
ument, is the model
urii Si
of the logi
al stru
ture as dened on page 46 and is the set of links dened on the
Li
page 40.
The extension of our information extra
tion method from physi
al to logi
al do
-
uments is straightforward. We assume, that the user spe
ies the URI of a do
ument
when invoking the information extra
tion pro
ess. For dis
overing the logi
al do
u-
ments, we adopt the method published in [60℄, whi
h is also dis
ussed in the se
tion
3.3.2. The information extra
tion pro
ess
onsists of following steps:
We
onsider the do
ument whose URI has been spe
ied by the user the index
page of the logi
al do
ument. We use the method [60℄ for obtaining the URIs
I
We
reate the models of the logi
al stru
ture for ea
h physi
al do
ument
Si di
We
reate a single model of the logi
al stru
ture for the whole logi
al do
ument
S
D by joining the logi
al stru
ture trees of the physi
al do
uments to a single
Si
tree.
54
When joining the logi
al stru
ture trees to the global model of the logi
al stru
-
Si
Let the root node of (the logi
al stru
ture tree of the index page) be the root
I
node of . S
( ). For ea
h link = (
urij ; Sj ; Lj l ); 2 we sele
t the text element 2
url; T l Lj e T
that pre
edes all remaining elements
ontained in (see the denition (5.4) on
T
the page 38). If the of the link
orresponds to the URI of any do
ument
uri
55
Chapter 6
Experimental System
Implementation
In order to evaluate the proposed information extra
tion method in the real world
wide web environment, we have implemented an experimental system for extra
ting
information from the data intensive do
uments in the web.
6.1 System Ar
hite
ture Overview
General ar
hite
ture of our information extra
tion system is shown in the gure 6.1.
The system
onsists of four basi
modules:
Interfa
e module implements an interfa
e between the system and the Internet.
It is responsible to download the ne
essary HTML do
uments and to store them
for further analysis.
Logi
al do
ument module implements the dis
overy of logi
al do
uments. Given
a URI of the main page, it analyzes the hypertext links and
reates a list of URIs
of the do
uments that form the logi
al do
ument.
Analysis module provides the analysis of the HTML
ode in order to
reate the
model of the visual information. Further, it implements the transformation of the
visual information model to the logi
al stru
ture model. Finally, it provides uni-
ation of the logi
al stru
ture models to the unied tree of the logi
al stru
ture.
Extra
tion module implements the information extra
tion from the logi
al stru
-
ture model using the tree mat
hing algorithms. The input is the user-spe
ied
extra
tion template and the output is the extra
ted data.
In order to rea
h maximal
exibility of the system and in order to be able to evalu-
ate the individual parts separately, the modules are implemented as standalone pie
es
of software that
ommuni
ate wit
h ea
h other. The
ommuni
ation with the interfa
e
module is simple { the module works as an HTTP proxy, so that the
ommuni
ation
is implemented using the HTTP proto
ol. The
ommuni
ation among the remaining
56
Starting Logical document module
URI
Logical document
discovery
HTML
HTTP
documents
requests
URI list (XML)
HTML
HTML document HTML parser
Internet
repository
HTTP
Interface module
Visual information
analyzer
Logical structure
analyzer
Analysis module
Extraction module
57
<?xml version="1.0" en
oding="iso-8859-2"?>
<!DOCTYPE logi
al_do
uments SYSTEM "ld.dtd">
<logi
al_do
ument url="http://www.xyz.
z/">
<do
ument url="http://www.xyz.
z/index.php" />
<do
ument url="http://www.xyz.
z/aboutus.php" />
<do
ument url="http://www.xyz.
z/
onta
t.php" />
</logi
al_do
ument>
modules is based on the use of XML. This solution allows to a
hieve maximal inter-
operability of the modules { ea
h module
an be used separately and the produ
ed
output
an be used for any task. This is parti
ularly useful for representing the logi
al
do
ument stru
ture, whi
h
an be used not only for information extra
tion but also for
other tasks where the existing XML pro
essing te
hnologies su
h as XSLT or XQuery
an be used. For this reason, we now des
ribe the use of XML in our system in more
detail.
6.2 Using XML for Module Communi
ation
The XML based
ommuni
ation is used in two points of the system. First, XML is used
for transferring lists from the logi
al do
ument module to the analysis module. Se
-
ondly, the most important use of XML is the representation of the global model of the
logi
al stru
ture. Furthermore, XML is also used for the extra
tion task spe
i
ation.
This appli
ation of XML is dis
ussed below in the se
tion 6.3.4. Formal denitions of
the used XML formats are given in appendix B.
6.2.1 Representing the Logi
al Do
uments
The data to be represented is formed by a simple list of URIs of the physi
al do
uments
that form the logi
al do
ument. We represent ea
h physi
al do
ument by a <do
ument>
tag with an attribute uri that
ontains the URI of the physi
al do
ument. The root
tag <logi
al do
ument> en
loses the list of physi
al do
uments. An example XML
representation of a logi
al do
ument is shown in the gure 6.2. The full DTD (Do
-
ument Type Denition) of the logi
al do
ument XML representation
an be found in
appendix B.2.
6.2.2 Logi
al Stru
ture Representation
Sin
e the logi
al do
ument model is a tree of text elements, the XML representation is
straightforward. Ea
h text element = (
e ) is represented by a single XML tag
s; v; x; w
<text> with the attributes that represent individual values of the tuple: the
ontent
attribute
arries the text
ontent of the elements and the attributes visual, expr and
58
<do
ument title="Sample" url="http://sample.org">
<text
ontent="Personal data" visual="2"
expr="3" weight="3">
<text
ontent="Name" ...>
<text
ontent="John Smith" .../>
</text>
<text
ontent="E-mail" ...>
<text
ontent="johnjohnsmith.org" .../>
</text>
</text>
</do
ument>
weight
ontain the values of , and respe
tively. Furthermore, a <do
ument> tag
v x w
This module provides the interfa
e between the system and the Internet. It works as
an HTTP proxy, the
ommuni
ation with the remaining modules is implemented using
the HTTP proto
ol [24℄. Ea
h module sends a request on a do
ument with a
ertain
59
URI. The interfa
e module downloads the do
ument and stores it lo
ally for later use.
Then it generates an HTTP reply
ontaining the
ode of the do
ument. The stored
do
uments are valid for a single information extra
tion task only. For improving the
reliability of the downloading pro
ess, any publi
ly available general HTTP
a
he
an
be used as for example Squid1.
6.3.2 Logi
al Do
ument Module
When provided with a URI, this module analyzes the stru
ture of links leading from this
do
ument and dis
overs the boundaries of the do
uments. As the result, it produ
es
the list of the URIs of physi
al do
uments.
This module implements the methods of HTML
ode parsing proposed in the se
tion
5.4, visual information modeling in do
uments that has been proposed in se
tion 5.2
and transforming this model to the logi
al stru
ture model as proposed in se
tion 5.5.
The resulting model of the logi
al stru
ture is represented using an XML do
ument as
des
ribed in se
tion 6.2.2.
6.3.4 Extra
tion Module
The extra
tion module implements the tree mat
hing algorithm dened in se
tion 5.6.
Given the logi
al stru
ture of the do
ument obtained from the analysis module, it
1
http://www.squid-
a
he.org
60
Figure 6.5: Analysis module
attempts to lo
ate subtrees of the logi
al stru
ture that
orrespond to the extra
tion
task spe
i
ation.
The task spe
i
ation
orresponds to a template of a subtree as shown on an ex-
ample in the gure 5.12 (page 52). Again, we use XML for dening the information
extra
tion task. The XML spe
i
ation has following format:
<task name="...">
<input>
<!-- Element format spe
ifi
ations -->
</input>
<model name="...">
<!-- data and label hierar
hy spe
. -->
</model>
</task>
The whole spe
i
ation is en
losed in the <task> tag and it is assigned an unique
name. The spe
i
ation
onsists of two parts. In the <input> part, the formats of text
elements are spe
ied. Ea
h format spe
i
ation
onsists of a name that identies the
format and a regular expression that spe
ies the format itself. Additionally, it
an
be spe
ied if the regular expression is
ompared in a
ase sensitive or
ase insensitive
manner. For example, the format for a name of a department
an be spe
ied as
<spe
ase="lower" name="dept">^[a-z0-9\ $℄+</spe
>
The <model> part of the task spe
i
ation denes the template tree. In this de-
nition, two tags
an be used: the <label> tags
orresponds to a label in the template,
<data> tag
orresponds to a text eld
ontaining a data value. These tags
an be
arbitrarily nested in order to
reate the
orresponding hierar
hy. Both these tags spe
-
ify an expe
ted format of a logi
al stru
ture tree node by using one or more format
spe
i
ation from the <input> se
tion, for example
<label spe
="poslabel|joblabel"/>
61
means that the node mat
hes to a node in the logi
al do
ument stru
ture that has either
the \poslabel" or the \joblabel" format that have been previously spe
ied using regular
expressions in the <input> se
tions. The dieren
e between <label> and <data> is
that the label is just mat
hed to a stru
ture tree node whereas the data forms part of
the extra
ted output data. Therefore, a data node must have a name attribute set, e.g.
<data spe
="posval" name="position"/>
that
orresponds to the name of the eld in the output data. An example of a
omplete
extra
tion task spe
i
ation is in Appendix A.
The module provides a graphi
al user interfa
e (see the gure 6.6) that allows to
browse the information extra
tion task and the tree of the logi
al stru
ture and to
browse the paths in the trees that
orrespond to ea
h other.
Control panel is a simple appli
ation that provides an unied user interfa
e for all the
modules. It allows to spe
ify the target URL and the extra
tion task spe
i
ation
le. Then, it
alls subsequently the individual modules and nally, it displays the
information extra
tion results.
62
Figure 6.7: Control panel
The format of the resulting XML le is derived from the extra
tion task spe
i
ation:
The sequen
e of all the output re
ords is en
losed in a root tag whose name
orresponds to the spe
ied extra
tion task name.
Ea
h output re
ord is en
losed in a tag whose name
orresponds to the name of
the extra
tion model.
Ea
h re
ord
onsists of data elds that are en
losed in tags whose names
orre-
spond to the spe
ied data eld names in the model se
tion of the task spe
i-
ation.
An output do
ument for the example task shown in Appendix A would have fol-
lowing stru
ture:
<?xml version="1.0" en
oding="iso-8859-2"?>
<staff>
<person>
<name>...</name>
<department>...</department>
<email>...</email>
</person>
<person>
...
</person>
...
</staff>
The resulting SQL s
ript
onsists of a sequen
e of INSERT
ommands that store the
extra
ted re
ords to a database table. The
ommands have following format:
63
The name of the target table
orresponds to the name of the name of the extra
-
tion model spe
ied using the <model> tag in the extra
tion task spe
i
ation.
The names of the table
olumns
orrespond to the spe
ied data eld names in
the model se
tion of the task spe
i
ation.
An output s
ript for the example task in Appendix A would have following format:
INSERT INTO person (name, department, email) VALUES (..., ..., ...);
INSERT INTO person ...
...
We assume that the
orresponding database table has been
reated in advan
e by
the user. In order to
reate the appropriate SQL
ommands automati
ally, it would be
ne
essary to extend the extra
tion task spe
i
ation by a posibility of spe
ifying the
data types of individual elds.
64
Chapter 7
Method Evaluation
In order to evaluate the fun
tionality of the proposed method, we have rst tested the
proposed information extra
tion method on sele
ted sets of physi
al HTML do
uments
available in the World Wide Web. As the next step, we have run the tests on larger
logi
al do
uments on the same web sites.
7.1 Experiments on Physi
al Do
uments
We have tested the method on two dierent
ases. The rst experiment was extra
ting
the personal information from various university sta member pages. These pages
are usually very stri
tly formatted and regular. The se
ond experiment deals with
sto
k quote servers. In this
ase the do
uments are signi
antly more
omplex,
ontain
various kinds of data and their stru
ture exhibits various irregularities.
7.1.1 Experiment 1 - Personal Information
We have tested the proposed method on a simple task that
onsists of pro
essing the
dire
tory pages of various universities and extra
ting the name, e-mail and department
of the persons listed in these pages. We have dened two extra
tion models that dier
in the expe
ted logi
al stru
ture of the presented information in the page. The two
possibilities are shown in the gure 7.1.
Name *
Figure 7.1: Two variants of the expe ted logi al stru ture
The rst variant
orresponds to the situation where the name of the person is used
as a page title or a heading. The se
ond variant
orresponds to the situation where
65
# URL Pre
ision % Re
all % Variant
1 fit.vutbr.
z 91 100 a
2 mff.
uni.
z 100 100 a
3 is.muni.
z 100 52 a
4 mit.edu - - a
5
ornell.edu 100 100 a
6 yale.edu 100 100 a
7 stanford.edu 100 100 b
8 harvard.edu 100 100 b
9 us
.edu - - b
10 psu.edu 100 100 b
Table 7.1: Sample extra
tion task results
the name is presented on the same level as the remaining data elds. By adding the
expe
ted eld formats and the expe
ted labels as dis
ussed in se
tion 5.6 we obtain
a template tree for ea
h variant (we will
all these variants A and B) as shown in
the gure 7.2. The XML extra
tion task spe
i
ation of both variants is in
luded in
Appendix A. For simpli
ity, all the regular expressions ex
ept the name value have
the
ase=lower attribute set in the spe
i
ation so that the text is always
onverted
to lower
ase before being mat
hed with the regular expression. The format of name is
mat
hed in the
ase-sensitive manner sin
e the name should start with a
apital letter.
^[A−Z][A−Za−z,\.\ ]+$ .*
As the data sour
e we have used sets of sta personal pages from various universities.
Table 7.1 shows values of pre
ision (3.1) and re
all (3.2) as dened in se
tion 3.1.3.
From ea
h listed site, we have taken 30 random personal pages. The only input for the
information extra
tion is the URI of the do
ument and the appropriate tree template
A or B.
There are three basi
reasons that may
ause the extra
tion pro
ess to fail:
The extra
ted data is not labeled as expe
ted; e.g. in the testing set 3 the e-mail
addresses are not denoted by any label
66
The data to be extra
ted is
ontained inside of larger text elements; e.g. the
appropriate data at mit.edu (4) is presented as unformatted text only and at
us
.edu the labels are not distinguished from the values
Various anti-spam te
hniques used in the do
uments su
h as writing e-mail ad-
dresses in some non-standard form or presenting is as an image
During the tests, the parameters have been set to QS K I P = 0 and M SK I P= 1.
In
reasing QS K I P and M SK I P
an improve the re
all in
ase that the format of the
data elds spe
i
enough; e.g. the e-mail address
an be dis
overed by its format only
so that it is not ne
essary to require any label. Generally, in
reasing these values
auses
a signi
ant loss of pre
ision.
7.1.2 Experiment 2 - Sto
k Quotes
As a se
ond experiment we have extra
ted some quote data from publi
ly available
quote servers. The template tree with the regular expressions is given in the gure 7.3.
The task
onsisted of extra
ting the last pri
e, the
hange of the last pri
e, the opening
pri
e, the last bid and the traded volume. Examples of some of the do
uments are
shown in the gure 7.4.
.*
Figure 7.3: Template tree for extra ting the quote data
The results are summarized in the table 7.2. The testing sets 2 and 6 both use tables
for presenting the data in quite
ompli
ated way so that the logi
al stru
ture of the
table is not dete
ted properly. It would be ne
essary to further improve the algorithms
for the table analysis in order to obtain the proper results. The lower re
all for the
testing sets 3, 4 and 5 has the usual reason { the data is not labeled as expe
ted by the
extra
tion task spe
i
ation. The solution is to
reate more variants of the expe
ted
logi
al stru
ture as dis
ussed in se
tion 5.6.1.
7.2 Independen
e on Physi
al Realization
In the gure 7.5 we
an see a
omparison of the web dire
tories of two dierent univer-
sities that
orrespond to the yale.edu and
ornell.edu testing sets. The data from
both these sites have been
orre
tly extra
ted using a single template tree denition
although ea
h of them uses a distin
t way of data presentation and the HTML
odes
67
finan
e.yahoo.
om quote server
quote. om server
68
# URL Pre
ision % Re
all %
1 finan
e.yahoo.
om 100 100
2 db
.
om 0 0
3 quote.
om 100 63
4 p
quote.
om 100 59
5 money.
nn.
om 100 44
6 uk.finan
e.yahoo.
om 0 0
Table 7.2: Sample extra
tion task results
of both do
uments are signi
antly dierent. Thus, the proposed extra
tion method is
to a great extent independent on the physi
al realization of the HTML do
uments.
However, the method still depends on the way of the data presentation; i.e. on
the logi
al stru
ture of the do
uments. Considering the experiments des
ribed in the
previous se
tion, this dieren
e
auses the ne
essity of two variants of the template
tree in the \personal data" experiment and, when only a single template tree is used,
it
auses lower re
all of the results as show the results of the se
ond, \sto
k quote"
experiment. The dieren
e between the logi
al stru
tures of the do
uments is apparent
from a visual
omparison of the samples. Let's
ompare the examples of the variant A
of the rst experiment that are shown in the gure 7.5 with the example of the variant
B that is in the gure 7.6. We
an noti
e that in the former
ase, the name of the person
is presented as being superior to the remaining data (it is used as a title) whereas in
the later
ase, the name of the person is listed in the same way as the remaining data.
The results show that in the proposed method, the independen
e on the HTML
ode
is partially repla
ed by the dependen
e on the logi
al do
ument stru
ture. However,
omparing with the number of various wrappers that would be ne
essary for extra
ting
the same data from the listed web sites (typi
ally, one for ea
h site), our method brings
a signi
ant improvement.
7.3 Information Extra
tion from Logi
al Do
uments
The proposed method for extra
ting data from logi
al do
uments has been developed
for pro
essing the logi
al do
uments that
onsist of stati
web pages. However, in the
real World Wide Web, it is not easy to nd stati
logi
al do
uments
ontaining large
amounts of data. In both
ases { the university sta dire
tories and the quote servers,
it is impossible to present all the data in stati
do
ument due to its great amount.
Instead, the data is stored in a database and the do
uments are generated dynami
ally
upon a user query. For example, the sta dire
tories typi
ally require entering at least
the last name of a person and return a do
ument
ontaining the data of all the sta
members of that last name. Therefore, the entire dire
tory
ontent
annot be a
essed
dire
tly through the web interfa
e. At this point, we fa
e the problem of the hidden
web as mentioned in se
tion 2.3.
For the mentioned reasons, we have used following method for obtaining the logi
al
69
Yale University Dire
tory
70
Figure 7.6: An example of the variant B { Stanford University Dire
tory
do
uments from the sta dire
tories in order to evaluate the proposed method.
We have
hosen some frequent last names, namely \Novak", \Dvorak", \Smith"
and \Johnson". With the standard web browser we used these names for querying the
web dire
tories of the above listed universities. In all the tested dire
tories, the query
results in a dynami
do
ument, that
ontains the list of mat
hing sta members and
the sear
hed name is
ontained in the URI of this do
ument. For example, the list
of all the sta members of Stanford University with the name
ontaining \Novak" is
available in a do
ument with the following URI:
https://stanfordwho.stanford.edu/lookup?sear
h=Novak&submit=Sear
h
This example shows an URI of a dynami
ally generated do
ument with two arguments:
sear
h and submit. From this point, no user queries are ne
essary and we
an regard
this do
ument as the main page of a logi
al do
ument that
ontains the data of all
the listed persons that mat
h the query. Note that this pro
edure is not required by
the proposed information extra
tion method as su
h; it is only a way of obtaining a
suÆ
ient amount of logi
al do
uments for the method evaluation.
On
e we have determined the URI of the main page of the logi
al do
ument, we
an
run the information extra
tion task on this URI. Sin
e this logi
al do
ument
ontains
the personal pages of all the mat
hing persons and the logi
al stru
tures of the indi-
vidual personal pages will appear as subtrees in the logi
al stru
ture tree of the whole
logi
al do
ument, the results of inrmation extra
tion should be the same as if the in-
dividual personal pages of the listed persons were pro
essed individually. This fa
t has
been
onrmed by the pra
ti
al tests { the information extra
tion results
orrespond
71
to the results for the individual physi
al do
uments that have been pro
essed during
the tests des
ribed in se
tion 7.1.1 and that are summarized in the table 7.1. The used
logi
al do
ument dis
overy algorithm (se
tion 5.7) however tends to in
lude more ad-
ditional pages to the logi
al do
ument that do not in
uen
e the information extra
tion
result but the download and pro
essing of these ex
essive pages is time-
onsuming.
For example, in
ase of the Harvard University, 62 do
uments have been downloaded
during the logi
al do
ument dis
overy whereas the data to be extra
ted was
ontained
in 4 of them.
In
ase of the quote servers, no summary pages are available that
ould be used as
the main pages of some logi
al do
uments. For obtaining the data, the quote symbols
must be entered exa
tly and therefore, the generated do
uments must be analyzed
separately.
72
Chapter 8
Con
lusions
Proposed method of information extra
tion is usable for real data-intensive do
uments
available through the World Wide Web. With data-intensive do
uments we mean the
do
uments that are primarily intended for presenting data in a relatively regular and
stru
tured form where the visual information plays an important role for the readers'
understanding of the do
ument. These do
uments often
ontain up-to-date data that
are worth extra
ting and typi
ally, the do
uments are automati
ally generated from a
ba
k end database. On the other hand, the method is not suitable for pro
essing the
do
uments where the desired information is buried in large blo
ks of unformatted or
poorly formatted text. For su
h do
uments, the traditional text do
ument pro
essing
and natural language pro
essing methods are more appli
able.
There is one more important issue in information extra
tion as su
h that has not
a te
hni
al nature. Quite often, the provider of the information that is presenting it
on the web prefers browsing the do
uments by people rather than the automati
tools,
mainly for marketing reasons (advertisement et
.). Su
h subje
ts often use various
te
hniques that
ompli
ate automati
pro
essing of the do
uments su
h as dete
tion
and blo
king of the
lient or \hiding" the information in the do
ument. There are,
however, still many areas where the information extra
tion is justiable and useful.
8.1 Summary of Contributions
Proposed method presents a novel approa
h to information extra
tion from HTML
do
uments. In
ontrast to
urrent methods based mainly on the dire
t analysis of the
HTML
ode, our method has following important features:
Independen
e on the underlying HTML
ode of the do
ument. The do
ument is
des
ribed by a abstra
t model, whi
h is then used for extra
ting information.
This abstra
tion avoids the dependen
e on parti
ular HTML tags, whi
h is the
bottlene
k of the wrapper approa
h.
Resistan
e to the
hanges of do
uments. The use of abstra
t model ensures that
the method is resistant to
hanges in the data presentation in the do
ument unless
the logi
al stru
ture of the do
ument
hanges.
73
No training phase required. The information extra
tion pro
ess
an start as soon
as the extra
tion task spe
i
ation is nished. There is no training set of exam-
ple do
uments needed. The method allows pro
essing new, previously unknown
do
uments that
orrespond to the extra
tion task spe
i
ation.
Aside from this main
ontribution, there are some issues in the method proposal
that we
onsider a signi
ant or novel
ontributions:
Formal models of the visual information in the do
ument. To the author's best
knowledge, this is a rst attempt to formally des
ribe the information
that is given to the user by visual means. We propose formal models of
two
omponents of this information { the page layout and the visual attributes
of the text.
Modeling the logi
al stru
ture on the basis of the visual model. This approa
h is
unique for the pro
essing of HTML do
uments although similar idea has been pro-
posed for other types of do
uments. The model of the logi
al stru
ture presents
an important information about the do
ument that
an be used (aside from infor-
mation extra
tion) in many other areas su
h as information retrieval (sear
hing
do
uments based on stru
tured queries instead of single keywords) or alternative
do
ument presentation (e.g. stru
ture-aware voi
e readers for blind people).
Appli
ation of the tree mat
hing algorithms. Although the hierar
hi
al organi-
zation of HTML do
uments is
ommonly a
epted, the use of tree mat
hing
algorithms for this task
an be
onsidered a novel
ontribution to this area.
Finally, the
ontribution of the thesis is an experimental information extra
tion
system that implements the proposed te
hniques. This system has been implemented
in the Java environment and has been used for verifying the method in the real world.
8.2 Possible Improvements and Future Work
From the point of view of further improvements of the method, the most important
point seems to be the proposed algorithm for analyzing the logi
al stru
ture of HTML
tables. Although these algorithm works satisfa
torily for most of the do
uments, there
exist more
omplex ways of presenting data in a table that are not re
ognized properly
as results for example from the test results given in the se
tion 7.1.2. For being able to
pro
ess a larger set of possible variants, a more sophisti
ated analysis method should
be developed.
The se
ond issue is the way of using the logi
al do
ument stru
ture for information
extra
tion. In our method, we use the tree mat
hing algorithms and we avoid any
ma
hine learning phase. However, for some appli
ations, it would be interesting to
use some ma
hine learning algorithms for inferring the extra
tion task spe
i
ation
automati
ally.
74
Bibliography
[1℄ Adelberg, B. NoDoSe { A Tool for Semi-Automati
ally Extra
ting Stru
tured
and Semistru
tured Data from Text Do
uments. In Pro
eedings of the 1998 ACM
SIGMOD international
onferen
e on Management of data Seattle, Washington,
United States, 1998
[2℄ Anjewierden, A. AIDAS: In
remental Logi
al Stru
ture Dis
overy in PDF Do
-
uments In 6th International Conferen
e on Do
ument Analysis and Re
ognition
(ICDAR). Seattle, USA, 2001
[3℄ Ashish, N., Knoblo
k, C. Wrapper Generation for Semi-stru
tured Internet
Sour
es. In Workshop on Management of Semistru
tured Data. Tu
son, Arizona,
1997
[4℄ Atzeni, P., Me
a, G., Merialdo, P. Semistru
tured and Stru
tured Data in the
Web: Going Ba
k and Forth. In Pro
eedings of ACM SIGMOD Workshop on
Management of Semi-stru
tured Data. 1997
[5℄ Baumgartner, R., Fles
a, S., Gottlob, G. Visual Web Information Extra
tion with
Lixto. In Pro
eedings of the 27th International Conferen
e on Very Large Data
Bases. Roma, Italy, 2001
[6℄ Berners-Lee, T. The Semanti
Web. S
ienti
Ameri
an. May 2001
[7℄ Bos, B., Lie, H.W., Lilley, C., Ja
obs, I. (editors). Cas
ading Style
Sheets, level 2, CSS2 Spe
i
ation. W3C Re
ommendation 12 May 1998.
http://www.w3.org/TR/1998/REC-CSS2-19980512
[8℄ Burget, R. Analyzing Logi
al Stru
ture of a Web Site. In Pro
eedings of 5th In-
ternational Conferen
e ISM '02 - Information Systems Modelling. Ostrava, CZ,
MARQ, 2002, p. 29-35, ISBN 80-85988-70-4
[9℄ Burget, R. HTML Do
ument Analysis for Information Extra
tion. In Pro
eedings
of 8th EEICT
onferen
e. Brno, CZ, FIT VUT, 2002, p. 426-430, ISBN 80-214-
2116-9
[10℄ Burget, R. Information Extra
tion from WWW Based on the Data Stru
ture
Knowledge (in
ze
h language). In Pro
eedings of the 2nd
onferen
e Znalosti 2003.
Ostrava, CZ, FEI VSB, 2003, p. 271-280, ISBN 80-248-0229-5
75
[11℄ Burget, R. Hierar
hies in HTML Do
uments: Linking Text to Con
epts. A
epted
for 3rd International Workshop on Web Semanti
s - WebS '04. Zaragoza, Spain,
2004
[12℄ Buttler, D., Liu, L., Pu, C. A Fully Automated Obje
t Extra
tion System for
the World Wide Web. In Pro
. of IEEE International Conferen
e on Distributed
Computing Systems. 2001
[13℄ Car
hiolo, V., Longheu, A., Malgeri, M. Extra
ting Logi
al S
hema from the Web",
In PRICAI Workshop on Text and Web Mining. Melbourne, Australia, 2000
[14℄ Chung, C.Y., Gertz. M., Sundaresan, N. Reverse Engineering for Web Data: From
Visual to Semanti
Stru
tures. In 18th International Conferen
e on Data Engi-
neering (ICDE 2002). IEEE Computer So
iety, 2002.
[15℄ Ciravegna, F. (LP)2 , an Adaptive Algorithm for Information Extra
tion from Web-
related Texts. In Pro
eedings of the IJCAI-2001 Workshop on Adaptive Text Ex-
tra
tion and Mining. Seattle, USA, 2001
[16℄ Clark, P., Niblett, T. The CN2 Indu
tion Algorithm. Ma
hine Learning, 3:261-283,
1989
[17℄ Cohen, W.W., Hurst, M., Jensen, L.S. A Flexible Learning System for Wrapping
Tables and Lists in HTML Do
uments. In Pro
eedings of the Eleventh International
World Wide Web Conferen
e. Honolulu, Hawaii, USA, 2002
[18℄ Cres
enzi, V., Me
a, G., Merialdo, P. RoadRunner: Towards automati
data ex-
tra
tion from large web sites. Te
hni
al Report n. RT-DIA-64-2001, D.I.A. Uni-
versita di Roma Tre, 2001
[19℄ Cres
enzi, V., Me
a, G., Merialdo, P. Automati
Web information Extra
tion in
the RoadRunner System. In International Workshop on Data Semanti
s in Web
Information Systems. Yokohama, Japan, 2001
[20℄ Deogun, J.S., Sever, H., Raghavan, V.V. Stru
tural Abstra
tions of Hypertext
Do
uments for Web-based Retrieval In Pro
eedings of DEXA 98 - 9th International
Conferen
e on Database and Expert Systems Appli
ations. Vienna, Austria, 1998
[21℄ DiPasquo, D. Using HTML Formatting to Aid in Natural Language Pro
essing on
the World Wide Web. S
hool of Computer S
ien
e, Carnegie Mellon University,
Pittsburgh, 1998
[22℄ Embley, D.W., Campbell, D.M., Jiang, Y.S., Ng, Y.-K., Smith, R.D., Liddle, S.W.,
Quass, D.W. A
on
eptual-modeling approa
h to extra
ting data from the web.
In Pro
. of the 17th International Conferen
e on Con
eptual Modeling (ER'98).
Singapore, 1998
[23℄ Embley, D.W., Jiang, Y.S., Ng, Y.-K. Re
ord-boundary dis
overy in Web do
u-
ments. In Pro
. of the 1999 ACM SIGMOD International Conferen
e on Manage-
ment of Data 1999
76
[24℄ Fielding, R., et al. Hypertext Transfer Proto
ol { HTTP/1.1. RFC2616, The Com-
puter So
iety, 1999. http://rf
.net/rf
2616.html
[25℄ Freitag, D. Using Grammati
al Inferen
e to Improve Pre
ision in Information Ex-
tra
tion. In ICML-97 Workshop on Automata Indu
tion, Grammati
al Inferen
e,
and Language A
quisition. 1997
[26℄ Freitag, D. Information extra
tion from HTML: Appli
ation of a general learning
approa
h. In Pro
. of the Fifteenth Conferen
e on Arti
ial Intelligen
e AAAI-98.
1998
[27℄ Freitag, D., M
Callum, A. Information Extra
tion with HMMs and Shrinkage.
In Pro
eedings of the AAAI-99 Workshop on Ma
hine Learning for Information
Extra
tion. 1999
[28℄ Fuhr, N., Grossjohann, K. XIRQL: An XML Query Language Based on Informa-
tion Retrieval Con
epts. In Pro
. of the 24th annual international ACM SIGIR
onferen
e on Resear
h and development in information retrieval. New Orleans,
USA, 2001
[29℄ Gibson, D., Kleinberg, J., Raghavan, P. Infering Web Communities from Link
Topology. In Pro
. of 6th Intl. Conferen
e on Database Systems for Advan
ed Ap-
pli
ations (DASFAA'99). IEEE Computer So
iety, 1999
[30℄ Gold, E.M. Language Identi
ation in the Limit. Information and Control,
10(5):447-474. 1967
[31℄ Grishman, R., Sundheim, B. Message Understanding Conferen
e { 6: A Brief
History. In Pro
eedings of the 16th International Conferen
e on Computational
Linguisti
s. Copenhagen, Denmark, 1996
[32℄ Gu, X.-D., Chen, J., Ma, W.-Y., Chen, G.-L. Visual Based Content Understanding
towards Web Adaptation. In Pro
. Adaptive Hypermedia and Adaptive Web-Based
Systems. Malaga, Spain, 2002, pp. 164-173
[33℄ Guan, T., Wong, K.F. KPS { a Web Information Mining Algorithm. In The 8th
International World Wide Web Conferen
e. Toronto, Canada, 1999
[34℄ Guttner, J. Obje
t Database on Top of the Semanti
Web. In Pro
eedings of the
WI/IAT 2003 Workshop on Appli
ations, Produ
ts and Servi
es of Web-based Sup-
port systems. Halifax, CA, 2003, pp. 97-102
[35℄ Hong, T.W., Clark, K.L. Using Grammati
al Inferen
e to Automate Information
Exra
tion from the Web. In Prin
iples of Data Mining and Knowledge Dis
overy.
2001
[36℄ Kan, M.-Y. Combining visual layout and lexi
al
ohesion features for text segmen-
tation. Columbia University Computer S
ien
e Te
hni
al Report, CUCS-002-01.
2001
77
[37℄ Kosala, R., Van den Buss
he, J., Bruynooghe, M., Blo
keel, H. Information Ex-
tra
tion in Stru
tured Do
uments using Tree Automata Indu
tion, In Prin
iples
of Data Mining and Knowledge Dis
overy, Pro
eedings of the 6th International
Conferen
e (PKDD-2002). 2002
[38℄ Kushmeri
k, N., Weld, D.S., Doorenbos, R.B. Wrapper Indu
tion for Information
Extra
tion, In International Joint Conferen
e on Arti
ial Intelligen
e. 1997
[39℄ Kushmeri
k, N. Wrapper Indu
tion: EÆ
ien
y and Expressiveness. Arti
ial In-
telligen
e vol. 118, no. 1-2, pp. 15-68, 2000
[40℄ Kushmeri
k, N. Wrapper Veri
ation. World Wide Web Journal vol. 3, no. 2, pp.
79-94, 2000
[41℄ Kushmeri
k, N., Thomas, B. Adaptive information extra
tion: Core te
hnolo-
gies for information agents In Intelligent Information Agents R&D in Europe: An
AgentLink perspe
tive Le
ture Notes in Computer S
ien
e 2586, Springer 2002.
[42℄ Lin, S.-H., Ho, J.-M. Dis
overing Informative Content Blo
ks from Web Do
-
uments. In The Eighth ACM SIGKDD International Conferen
e on Knowledge
Dis
overy and Data Mining (SIGKDD'02). 2002
[43℄ Liu, B., Grossman, R., Zhai, Y. Mining Data Re
ords in Web Pages In Pro
eedings
of the ACM SIGKDD International Conferen
e on Knowledge Dis
overy and Data
Mining (KDD-2003). Washington, DC, USA, 2003
[44℄ Manola, F., Miller, E. (editors) RDF Primer. W3C Working Draft 11 November
2002
http://www.w3.org/TR/2002/WD-rdf-primer-20021111/
[45℄ May, W. Modeling and Querying Stru
ture and Contents of the Web. In Pro
eed-
ings of the 10th International Workshop on Database and Expert Systems Appli-
ations Floren
e, Italy, 1999
[46℄ Mello, R., Heuser, C.A. A Bottom-Up approa
h for Integration of XML Sour
es In
International Workshop on Information Integration on the Web. Rio de Janeiro,
Brasil, 2001
[47℄ Morkes, J., Nielsen, J. Con
ise, SCANNABLE, and Obje
tive: How to Write for
the Web, 1997
http://www.useit.
om/papers/webwriting/writing.html
[48℄ Mukherjee, S., Yang, G., Tan, W., Ramakrishnan, I.V. Automati
dis
overy of Se-
manti
Stru
tures in HTML Do
uments. In International Conferen
e on Do
ument
Analysis and Re
ognition (ICDAR). 2003
[49℄ Muslea, I., Minton, S., Knoblo
k, C.A. Hierar
hi
al Wrapper Indu
tion for
Semistru
tured Information Sour
es. Journal of Autonomous Agents and Multi-
Agent Systems 4:93-114, 2001
78
[50℄ Nahm, U.Y., Mooney, R.J. Text Mining with Information Extra
tion. In Pro
eed-
ings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and
Knowledge Bases. 2002.
[51℄ Nelson, T.H. Complex information pro
essing: a le stru
ture for the
omplex, the
hanging and the indeterminate. In ACM/CSC-ER, Pro
eedings of the 1965 20th
national
onferen
e. Cleveland, USA, 1965
[52℄ Niu, C., Li, W., Ding., J., Srihari, R. A Bootstrapping Approa
h to Named Entity
Classi
ation Using Su
essive Learners. In Pro
eedings of the ACL-2003 Confer-
en
e. Sapporo, Japan, 2003
[53℄ Quinlan, J.R., Cameron-Jones, R.M. FOIL: A Midterm Report, Ma
hine Learning:
ECML-93, Vienna, Austria, 1993
[54℄ Raggett, D., Le Hors, A., Ja
obs, I. (editors). HTML
4.01 Spe
i
ation. W3C Re
ommendation 24 De
ember 1999.
http://www.w3.org/TR/1999/REC-html401-19991224
[55℄ Salton, G. Re
ent Studies in Automati
Text Analysis and Do
ument Retrieval.
JACM, 20(2):258-278, Apr. 1973
[56℄ Shasha, D., Wang, J.T.L, Shan, H., Zhang, K. ATreeGrep: Approximate Sear
hing
in Unordered Trees. In 14th International Conferen
e on S
ienti
and Statisti
al
Database Management. Edinburgh, S
otland, 2002
[57℄ Soderland, S. Learning to Extra
t Text-based Information from the World Wide
Web, In Pro
eedings of Third International Conferen
e on Knowledge Dis
overy
and Data Mining (KDD-97). 1997
[58℄ Summers, K. Toward a Taxonomy of Logi
al Do
ument Stru
tures. In Ele
troni
Publishing and the Information Superhighway: Pro
eedings of the Dartmouth Insti-
tute for Advan
ed Graduate Studies (DAGS '95). Boston, USA, 1995, pp. 124-133
[59℄ Summers, K. Automati
Dis
overy of Logi
al Do
ument Stru
ture. PhD thesis.
Cornell Computer S
ien
e Department Te
hni
al Report TR98-1698, 1998
[60℄ Tajima, K., Tanaka, K. New Te
hniques for the Dis
overy of Logi
al Do
uments
in Web. In International Symposium on Database Appli
ations in Non-Traditional
Environments (DANTE'99). 1999
[61℄ World Wide Web Consortium (W3C) pages. http://www.w3.org/
[62℄ Yang, Y., Zhang, H. HTML Page Analysis Based on Visual Cues. In Pro
. of 6th
International Conferen
e on Do
ument and Analysis. Seattle, USA, 2001
[63℄ Zizi, M., Lafon, M. Hypermedia exploration with intera
tive dynami
maps. In-
ternational Journal on Human Computer Intera
tion Studies. 1995
79
Appendix A
80
</label>
</data>
</model>
</task>
Variant B
81
<spe
ase="lower" name="int">[1-9℄[0-9,℄+</spe
>
<spe
ase="lower" name="na">N/A</spe
>
</task>
82
Appendix B
<!-- Spe
ifi
ation of a named field format. The #PCDATA
ontains
a regular expression. The
omparison may be either
ase
sensitive or lower
ase. -->
<!ELEMENT spe
(#PCDATA)>
<!ATTLIST spe
name CDATA #REQUIRED
ase (lower|sensitive) "lower">
83
<!-- Template tree spe
ifi
ation -->
<!-- A hierar
hy of labels (not in
luded in the output) and data
(in
luded in the output). The spe
attribute must
orrespond
to a name of any <spe
> element above, logi
al disjun
tion
an be denoted by "name1|name2". -->
<!ELEMENT model (label*, data*)>
<!ATTLIST model
name CDATA #REQUIRED>
84
format of this le is des
ribed in detail in se
tion 6.2.2 on page 58.
<?xml version="1.0" en
oding="UTF-8"?>
85