Vous êtes sur la page 1sur 92

Brno University of Te

hnology
Fa ulty of Information Te hnology

Information Extra tion from HTML


Do uments Based on Logi al
Do ument Stru ture
by

Radek Burget
M.S ., Brno University of Te hnology, 2001

a thesis submitted in partial fulfillment


of the requirements for the degree of
do tor of philosophy

Supervisor: Jaroslav Zendulka, asso iate professor


Submitted on: August 27, 2004
State do toral exam passed on: June 18, 2003
This thesis is available at the library of the Fa ulty of
Information Te hnology of the Brno University of Te hnology
2004 Radek Burget
All rights reserved. This work may not be
reprodu ed in whole or in part, by photo opy
or other means, without the permission of the author.
Contents

Abstra t v
Prefa e vi
1 Introdu tion 1
1.1 Information Extra tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 The World Wide Web Te hnology 5
2.1 Hypertext Markup Language . . . . . . . . . . . .. . .. . . .. . .. . 5
2.2 Cas ading Style Sheets . . . . . . . . . . . . . . . .. . .. . . .. . .. . 6
2.3 Dynami Web Content . . . . . . . . . . . . . . . .. . .. . . .. . .. . 7
2.4 Web Servi es . . . . . . . . . . . . . . . . . . . . .. . .. . . .. . .. . 8
2.5 Semanti Web . . . . . . . . . . . . . . . . . . . . .. . .. . . .. . .. . 9
2.6 Current Information Extra tion Alternatives . . .. . .. . . .. . .. . 10
3 State of the Art 12
3.1 Information Extra tion from HTML Do uments . . . . . . . . .. . .. . 12
3.1.1 Wrappers . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . 14
3.1.2 Do ument Code Modeling . . . . . . . . . . . . . . . . .. . .. . 16
3.1.3 Wrapper Indu tion Approa hes . . . . . . . . . . . . . .. . .. . 17
3.1.4 Alternative Wrapper Constru tion Approa hes . . . . .. . .. . 20
3.1.5 Computer-aided Manual Wrapper Constru tion . . . . .. . .. . 21
3.2 Cas ading Style Sheets from the IE Perspe tive . . . . . . . . .. . .. . 21
3.3 Advan ed Do ument Modeling . . . . . . . . . . . . . . . . . .. . .. . 22
3.3.1 Logi al versus Physi al Do uments . . . . . . . . . . . .. . .. . 22
3.3.2 Logi al Do ument Dis overy . . . . . . . . . . . . . . .. . .. . 23
3.3.3 Logi al Stru ture of Do uments . . . . . . . . . . . . . .. . .. . 24
3.3.4 Visual Analysis of HTML Do uments . . . . . . . . . .. . .. . 26
4 Motivation and Goals of the Thesis 28
5 Visual Information Modeling Approa h to Information Extra tion 31
5.1 Proposed Approa h Overview . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Visual Information Modeling . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Modeling the Page Layout . . . . . . . . . . . . . . . . . . . . . . 34

iii
5.2.2 Representing the Text Features . . . . . . . . . . . . . .. . .. . 37
5.3 Representing the Hypertext Links . . . . . . . . . . . . . . . .. . .. . 40
5.4 HTML Code Analysis for Creating the Models . . . . . . . . .. . .. . 40
5.4.1 Tables in HTML . . . . . . . . . . . . . . . . . . . . . .. . .. . 44
5.4.2 Example of Visual Models . . . . . . . . . . . . . . . . .. . .. . 45
5.5 Logi al Stru ture of a Do ument . . . . . . . . . . . . . . . . .. . .. . 46
5.6 Information Extra tion from the Logi al Model . . . . . . . . .. . .. . 50
5.6.1 Using Tree Mat hing . . . . . . . . . . . . . . . . . . . .. . .. . 51
5.6.2 Approximate Unordered Tree Mat hing Algorithm . . .. . .. . 52
5.7 Information Extra tion from Logi al Do uments . . . . . . . .. . .. . 54
6 Experimental System Implementation 56
6.1 System Ar hite ture Overview . . . . . . . . . . .. . .. . . .. . .. . 56
6.2 Using XML for Module Communi ation . . . . . .. . .. . . .. . .. . 58
6.2.1 Representing the Logi al Do uments . . . .. . .. . . .. . .. . 58
6.2.2 Logi al Stru ture Representation . . . . . .. . .. . . .. . .. . 58
6.3 Implementation . . . . . . . . . . . . . . . . . . . .. . .. . . .. . .. . 59
6.3.1 Interfa e Module . . . . . . . . . . . . . . .. . .. . . .. . .. . 59
6.3.2 Logi al Do ument Module . . . . . . . . . .. . .. . . .. . .. . 60
6.3.3 Analysis Module . . . . . . . . . . . . . . .. . .. . . .. . .. . 60
6.3.4 Extra tion Module . . . . . . . . . . . . . .. . .. . . .. . .. . 60
6.3.5 Control Panel . . . . . . . . . . . . . . . . .. . .. . . .. . .. . 62
6.4 Information Extra tion Output . . . . . . . . . . .. . .. . . .. . .. . 63
6.4.1 Extra ted Data as an XML Do ument . . .. . .. . . .. . .. . 63
6.4.2 Extra ted Data as an SQL S ript . . . . . .. . .. . . .. . .. . 63
7 Method Evaluation 65
7.1 Experiments on Physi al Do uments . . . . . . . .. . .. . . .. . .. . 65
7.1.1 Experiment 1 - Personal Information . . . .. . .. . . .. . .. . 65
7.1.2 Experiment 2 - Sto k Quotes . . . . . . . .. . .. . . .. . .. . 67
7.2 Independen e on Physi al Realization . . . . . . .. . .. . . .. . .. . 67
7.3 Information Extra tion from Logi al Do uments .. . .. . . .. . .. . 69
8 Con lusions 73
8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2 Possible Improvements and Future Work . . . . . . . . . . . . . . . . . . 74
Bibliography 75
A Example Task Spe i ation 80
B Do ument Type De nitions 83
B.1 Task Spe i ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B.2 Logi al Do ument Representation . . . . . . . . . . . . . . . . . . . . . . 84
B.3 Logi al Stru ture Representation . . . . . . . . . . . . . . . . . . . . . . 84

iv
Abstra t
The World Wide Web presents the largest Internet sour e of information from a broad
range of areas. The web do uments are mostly written in the Hypertext Markup
Language (HTML) that doesn't ontain any means for semanti des ription of the
ontent and thus the ontained information annot be pro essed dire tly. Current
approa hes for the information extra tion from HTML are mostly based on wrappers
that identify the desired data in the do ument a ording to some previously spe i ed
properties of the HTML ode. The wrappers are limited to a narrow set of do uments
and they are very sensitive to any hanges in the do ument formatting.
In this thesis, we propose a novel approa h to information extra tion that is based
on modeling the visual appearan e of the do ument. We show that there exist some
general rules for the visual presentation of the data in do uments and we de ne formal
models of the visual information ontained in a do ument. Furthermore, we propose
the way of modeling the logi al stru ture of an HTML do ument based on the visual
information. Finally, we propose methods for using the logi al stru ture model for the
information extra tion task based on tree mat hing algorithms. The advantage of this
approa h is ertain independen e on the underlying HTML ode and better resistan e
to hanges in the do uments.

v
Prefa e
In 2001, when I nished my master studies at the Fa ulty of Ele tri al Engineering and
Computer S ien e at the Brno University of Te hnology, I was thinking about enrolling
the PhD. program. There were multiple topi s available but one of them attra ted me
more than the other ones: \Methods of knowledge dis overy in the WWW". Sin e I
was fas inated by everything related to the web, I started working on this topi under
the supervision of Jaroslav Zendulka. Very qui kly I realized how broad this area is
and I de ided to fo us on pro essing the web ontent, whi h is always the rst step of
the data mining pro ess.
During the rst year, I have been trying to orientate myself in the topi . In addition
to reading great amounts of available papers, I attended the EDBT 2002 Summer S hool
on distributed databases on the Internet. In 2002, I also spent three months at a study
stay at the University of Valladolid in Spain, whi h has been also very valuable for me.
This thesis is the result of the last three years' work. In my rst papers on erning
this topi [8, 9℄, I formulated the features of urrent information extra tion approa hes
that in my opinion aused the major problems of the existing methods and I proposed
a more abstra t look at the HTML do uments. During further work, I proposed some
suitable models of the do uments and the methods of using them for information ex-
tra tion [10℄. The last year I spent on a formal spe i ation of the proposed models
and methods [11℄.
Organization of the Thesis
This thesis is organized as follows. Chapter 1 ontains a short introdu tion to the
area of pro essing the data a essible through the World Wide Web and explains basi
on epts of the information extra tion from the web do uments.
Chapter 2 gives a brief overview of the urrent World Wide Web te hnology. Basi
on epts of the most important languages are explained and the main te hnologi al
aspe ts and forth oming te hnologies are dis ussed.
In Chapter 3, the state of the art in the information extra tion from HTML do u-
ments and related areas is summarized.
In Chapter 4 we summarize major problems of the urrent information extra tion
approa hes as the motivation for this thesis and we formulate the goals of this thesis.
Chapter 5 is the theoreti al ore of the thesis: it ontains the formal models of the
visual appearan e and the logi al stru ture of HTML do uments and introdu es a novel
information extra tion method based on these models.

vi
In hapter 6, we des ribe an experimental system that implements the proposed
information extra tion method and that has been used for testing the method on real
data.
Chapter 7 summarizes the experimental results.
And nally in Chapter 8, we on lude the thesis. We dis uss possible uses of the
information extra tion and of the logi al do ument stru ture modeling. We summarize
the major ontributions of the thesis and we propose possible improvements of the
method and dire tions of further investigation in this area.
The appendi es ontain the XML spe i ations of the information extra tion tasks
that have been used for testing the method and formal de nitions of the XML formats
used within the experimental system.
A knowledgements
I would like to express my sin ere thanks to people who have helped me writing this
thesis: Jaroslav Zendulka, my supervisor, for his valuable omments and organizational
support, Alexander Meduna for his great help with the formal spe i ation issues.
Many thanks also deserve people who supported me while writing this thesis: my
parents, my brother and my girlfriend Jana.

vii
Chapter 1

Introdu tion
The World Wide Web presents urrently the largest Internet sour e of information from
a broad range of areas. One of the main reasons of this great expansion is the absen e of
any stri t rules for the information presentation and the relative simpli ity of the used
te hnology. The orientation to do uments and the HTML language that is mainly used
for reating the do uments, gives the authors enough freedom for presenting any kind
of data with minimal e ort and the new te hnologies su h as Cas ading Style Sheets
(CSS) allow to a hieve the desired quality of presentation. Together with the hypertext
nature of the do uments, these properties make the World Wide Web a distributed and
dynami sour e of information.
On the other hand, the loose form of the data presentation brings also some draw-
ba ks. With the in reasing number of available do uments a problem arises, how to
eÆ iently a ess and utilize all the data they ontain. Due to the above properties
of the web, related information is often presented at di erent web sites and in diverse
forms. A essing this data by browsing the do uments manually is a time- onsuming
and ompli ated task. For this reason, it is desirable to automati ally pro ess the do -
uments by a omputer. As a rst step, there exists an e ort to provide the users with
entralized views of related data from various sour es in the World Wide Web su h
as the servi es for omparing the pri es of goods in on-line shops. The next step is
presented by the data mining te hniques that have been developed for the database
bran h.
For all these tasks, it is ne essary to a ess the data that are ontained in the do u-
ments. This is however not a trivial problem. As mentioned above, the state-of-the-art
web onsists mainly of the do uments written in the Hypertext Markup Language
(HTML) [54℄. This language is suitable for the de nition of the presentation aspe t of
the do uments but it la ks any means for the de nition of the ontent semanti s. The
information ontained in the do uments an be therefore very hardly interpreted and
pro essed by a omputer.
Possible answer to this problem is the proposal of semanti web [6℄ that is based
on di erent te hnology and has quite di erent nature. While the lassi al web an be
viewed as a distributed do ument repository, the semanti web has many hara teristi s
of an obje t database [34℄. Although this te hnology is very promising and it is being
developed rapidly, it doesn't solve pro essing of the great amount of do uments that

1
Figure 1.1: An Information Extra tion Task

are already available in the \lega y" web. Moreover, simultaneously with the semanti
web, the lega y web is still growing and developing. In ontrast to the semanti web, the
evolution of the te hnologies has however quite di erent dire tion. Currently, the most
important issue in the development of the lassi al web te hnologies is the exibility
of the web design and e e tive management of the web ontent. For the information
providers, it is not the aim to allow better automati pro essing of the do uments on
the web and in some ases it is even undesirable. All these fa ts together with the great
amount of information that is virtually available make the automati HTML do ument
pro essing an interesting and hallenging area of investigation.
1.1 Information Extra tion
Our aim in the HTML do ument pro essing is to identify a parti ular information that
is expli itly ontained in the text of the do ument and to store it in a stru tured form;
e.g. to a database table or a XML do ument. This pro ess is usually alled information
extra tion from the HTML do uments [21, 26, 38℄.
As an example, let's imagine a set of pages ontaining information about various
ountries as shown in the Figure 1.1. The task of information extra tion in this ase
is to identify the values of ountry name, area, population et . and to store them in a
stru tured way. The result of pro essing su h a set of do uments ould be a database
table ontaining the appropriate values for ea h ountry.
There are two basi approa hes to the information extra tion. The rst and most
ommon one assumes that it is spe i ed in advan e, what data are to be extra ted from

2
the do uments. The spe i ation for the ountry information task mentioned above
an have for example following format
C OU N T RY ::=
< name; area; population; apital >

i.e. for ea h ountry we want to extra t its name, geographi al area, population and the
name of the apital. This type of task is often alled the slot lling task. In ontrary,
the other approa h is to analyze the do uments and extra t all the available data.
This approa h requires a set of do uments to be ompared in order to distinguish the
relevant data from the remaining text [18℄. It is assumed that the relevant data di ers
among the do uments whereas the other text remains stati .
The pro ess of extra ting information from the World Wide Web an be split to
following phases:
1. Lo alization of relevant do uments
2. Identi ation of the data in the do uments
3. Data storage
The rst one is the information retrieval phase where the do uments ontaining the
appropriate data must be lo alized in the World Wide Web. Sin e ea h do ument is
identi ed by its URI, the result of this phase is a list of URIs of the do uments to
be pro essed. In the World Wide Web, there are several types of servi es available
for lo ating the do uments, from the Web dire tories that provide ategorized lists of
URLs to the automati sear h engines su h as Google 1 that are based on indexing
terms in the do uments. The hypertext nature of HTML do uments brings another
problem: the presented data an be split to several HTML do uments that ontain
links to ea h other. Then, this set of HTML do uments forms a logi al entity that
is usually alled a logi al do ument [60℄. Sin e the way in whi h the information is
split to individual do uments annot be predi ted, for extra ting the information all
the physi al do uments must be pro essed. As a result, the URIs of all the do uments
must be dis overed.
The identi ation of the data in do uments is the key phase of the information
extra tion pro ess. During this phase, a model of the logi al do ument is reated and
the desired data values are identi ed by ertain extra tion rules. The hara ter of these
rules depends on the model of the do ument and the information extra tion method.
Alternatively, the data values may be identi ed based on a statisti al model of the
do ument (e.g. the Hidden Markov Models) instead of expli it extra tion rules.
The last phase of the pro ess is the data storage. This phase depends ompletely
on the appli ation.
Although the information extra tion area had established long before the World
Wide Web ome into being and its appli ation to the web is almost as old as the
web itself, due to the enormous extent and variability of the web it still presents a
hallenging problem that hasn't been satisfa torily resolved yet. In this thesis, we fo us
1
http://www.google. om

3
on the parts of the information extra tion pro ess that are not suÆ iently explored yet.
In the rst phase it is the problem of the logi al do ument dis overy. We dis uss the
existing approa hes and we propose ertain improvements of urrent te hniques. The
main fo us of our work is on the data identi ation phase. We analyze the problems of
urrent methods and we propose a novel approa h to this problem based on modeling
the visual aspe t of the do uments and their logi al stru ture.

4
Chapter 2

The World Wide Web


Te hnology
The World Wide Web is built on a on ept of hypertext. This notion has been introdu ed
in 1965 by T. H. Nelson [51℄ and refers to a text that is not onstrained to be linear; it
ontains links to other texts. Hypertext that is not onstrained to text only is alled
Hypermedia. It an in lude graphi s, video and sound, for example.
The World Wide Web is do ument-oriented. Ea h do ument is identi ed by a
Uniform Resour e Identi er (URI). A notion of Uni ed Resour e Lo ator (URL) is often
used in the same meaning. A do ument an ontain any type of data (text, hypertext,
graphi s, et .). For identifying the type of ontent, the MIME (Multipurpose Internet
Mail Extension) standard is used.
The web is based on lient-server ar hite ture. The do uments are stored on a
web server that is a essible through the Internet. The lient a esses the server and
downloads the desired do uments using the Hypertext Transfer Proto ol (HTTP) [24℄.
In most ases, the lient is represented by a web browser that displays the do ument
on the s reen.
Most of the do uments in the World Wide Web are reated using the Hypertext
Markup Language (HTML). Additionally, more te hnologies have been developed for
improving the presentation apabilities and to make the maintenan e of the ontent
more eÆ ient. There exist also many standard formats for presenting the non-textual
data su h as images and multimedia les.
2.1 Hypertext Markup Language
Hypertext Markup Language (HTML) [54℄ is a basi language for publishing hypertext
on the web. It is based upon SGML (Standard Generalized Markup Language, ISO
8879). An HTML do ument is basi ally a text le enri hed by standardized tags.
HTML ontains tags for de ning the stru ture of the do ument (headings, paragraphs),
basi page obje ts (lists, tables, images), text formatting (font and olor sele tion,
spe ial e e ts) and, of ourse, hypertext links to other do uments. Most of the tags
are pair tags that onsist of an opening tag written <tagname> and losing tag written

5
</tagname>. For example, a portion of text an be written in bold by following ode:
Normal text <b>bold text</b> normal text.
that will be printed out as:
Normal text bold text normal text.
Some of the tags are unpaired, for example a line break an be inserted using just a
<br> tag. Some of the tags allow or require spe ifying additional attributes that de ne
the meaning of the tag more pre isely. For example, when inserting an image to the
do ument, an attribute sr must be spe i ed that identi es the le where the image
is stored: <img sr ="image.jpg">.
The SGML language that has been used as a basis for reating HTML is very pow-
erful and exible. However, this great exibility brings ertain drawba k su h as, for
example, ompli ated implementation. For this reason, a simpli ed derivation of SGML
has been reated that is alled XML (eXtensible markup language). The most impor-
tant simpli ation is that only the pair tags are allowed and the tags must be properly
nested (they may not overlap). For hypertext publishing, a XML variant of HTML has
been reated that is alled XHTML (eXtensible hypertext markup language). In om-
parison to HTML, many tags have been omitted and repla ed by other means (mainly
the Cas ading Style Sheets that are des ribed below). The unpaired tags have been
repla ed by pair tags even in the ases that the pair tag gives no sense { e.g. the above
mentioned line break must be written as <br></br> whi h an be a ording to the
XML spe i ation also written as <br/>. Currently, XHTML is an up oming standard
in web publishing. All the mentioned web standards are being maintained by the World
Wide Web Consortium1.
In this thesis, we fo us on hypertext do uments written in the HTML language
optionally in onjun tion with the Cas ading Style Sheets. However, all the te hniques
and methods proposed below apply to the XHTML language as well.
2.2 Cas ading Style Sheets
Cas ading Style Sheets (CSS) is a simple me hanism for adding style (e.g. fonts, olors,
spa ing) to Web do uments. It allows separating the visual style of the page from the
HTML ode that fa ilitates the reation and management of the web pages and allows
better exibility. For XHTML, the Cas ading Style Sheets presents a unique me hanism
for spe ifying the visual style of the page.
The style sheets onsist of a set of rules that de ne the style for parti ular (X)HTML
tags. For example, a rule an look as follows:
h1 f font-weight: bold; olor: blue; g
This rule says that the headings marked with the <h1> tag will be printed in bold and
blue olor. It onsists of a sele tor that determines on whi h tags this rule should be
1
http://www.w3.org

6
applied and a de laration. The de laration onsists of a name of the CSS property (in
our ase font-weight: and olor:) and its value (bold and blue). The values of the
individual properties are inherited as the tags are nested in the HTML ode { the values
of properties that are not spe i ed in our example are inherited for example from the
de laration of the <body> that en loses the whole ontent of an HTML do ument.
Ea h element in the HTML ode denoted by some tag an be assigned a named lass
using the lass attribute (e.g. <p lass="heading">) and/or assigned an identi er
using the id attribute that is unique for the whole do ument. The CSS rule sele tors
an be based not only on the tag names (as we an see in the example) but also on the
element lasses and identi ers.
The style sheets are in orporated to the HTML ode by one of following ways:
 The style sheet is pla ed in a separate le that is available through the web and
it is referen ed in the HTML do ument header.
 The style sheet is dire tly inserted into the header of the HTML do ument.
 The de laration is spe i ed without a sele tor, dire tly for a parti ular tag in the
do ument as a value of its style attribute. This is alled inline style de nition.
For example, the tag <p style="font-weight: bold;"> starts a paragraph
written in bold.
The use of CSS has grown signi antly in the last few years, whi h has been al-
lowed by suÆ ient support of CSS by the modern web browsers. It is now pra ti ally
unimaginable to reate a modern, well-designed web presentation without the use of
CSS.
2.3 Dynami Web Content
The modern web te hnologies allow reating do uments that don't only ontain a stati
text but their ontents an be generated dynami ally upon a lient request a ording
to various fa tors. Basi ally, the dynami ontent an be generated two in ways:
 On the server side. There are various te hnologies that allow generating the
HTML ode in the time of the lient request (e.g. CGI, PHP, ASP, JSP, et .). In
this ase, no support for dynami ontent is required on the lient side, the lient
always obtains a omplete HTML do ument that an be, however, di erent on
ea h request.
 In the lient browser. The HTML do ument that is sent to the lient ontains
routines written in a s ripting language (in most ases, JavaS ript language is
used) that are interpreted by the lient browser. The s ripting language must be
supported by the browser.
From the do ument pro essing point of view, the former ase doesn't present any
parti ular problem be ause the lient always obtains a omplete do ument that an be
pro essed dire tly. The use of JavaS ript or similar te hnology presents a serious prob-
lem not only for automati web ontent pro essing tools but also for various alternative

7
browsers su h as text-oriented browsers or voi e readers for blind people. For this rea-
son, JavaS ript should be used as an additional feature only and its support shouldn't
be automati ally expe ted by the authors of do uments. For the same reason, we don't
solve the problems related to the dynami web ontent generated on the lient side in
this thesis.
An additional problem of the dynami ally generated pages on the server side is
that the generation pro edure requires ertain data to be supplied by the lient when
requesting the page from the server. Typi ally, this is the ase of the pages that
are generated a ording to the values lled-in to an intera tive form that is available
on another page. Without lling some elds of the form { i.e. without providing a
parti ular data along with the page request, the dynami page is not a essible. The
great amount of su h pages available through the web is often alled a hidden web
be ause its ontent is not easily a essible to any automati web ontent pro essing
tools in luding the sear h engine indexing robots. Although the hidden web presents
a onsiderable problem for any automati pro essing of the web ontent, it forms a
separate and quite large area of investigation that is quite distant from the information
extra tion as su h. Therefore, in this thesis, we only mention this problem without
dis ussing it in greater detail.
2.4 Web Servi es
Web servi es present a way of using the Internet for appli ation to appli ation om-
muni ation. They provide a standardized way of spe ifying the apabilities and pro-
grammati interfa es of the servi es that are available over the Internet and the om-
muni ation proto ols that allow using the servi es by other servi es or appli ations.
This possibilities originate a distributed servi e ar hite ture where the servi es an use
ea h other for performing a parti ular task and the inter-servi e ommuni ation and
the results have a standardized, omputer-pro essable format.
For example, the Google sear h engine provides a standard web interfa e where the
user uses a web browser for downloading an HTML page with a query form; he lls
in the sear h query and posts the query to the server. The server returns the list of
sear h results, again as an HTML do ument. The results in this form are intended to
be displayed only; it would be very ompli ated to further pro ess the query results by
some appli ation. For this reason, Google provides an alternative interfa e in a form
of a web servi e. This servi e re eives the query and returns the appropriate results
using the standardized web servi e proto ols based on XML so that the results may be
further pro essed.
The roles in web servi es and the relations among the proto ols and the roles are
shown in the gure 2.1. The basi roles are the servi e provider and the servi e requester
( lient). The requester sends a request and the provider sends ba k the results both
using the Simple Obje t A ess Proto ol (SOAP). In order to allow the automati
lo alization of appropriate web servi es over the Internet, the web servi e proposal
in ludes a role of a servi e broker that maintains a registry of available web servi es.
Ea h servi e publishes information about its purpose and interfa e to the servi e broker

8
using the Web Servi es Des ription Language (WSDL). The servi e registry format and
the way of querying are is des ribed by the UDDI (Universal Des ription, Dis overy,
and Integration) standard.
Service
Broker

Publish Find
(WSDL) (UDDI)

Service Service
Provider Bind Requester
(SOAP)

Figure 2.1: The relations among proto ols and roles in web servi es

2.5 Semanti Web


A ording to [6℄, the Semanti Web is an extension of the urrent web in whi h infor-
mation is given well-de ned meaning, better enabling omputers and people to work
in ooperation. The semanti web proposal is based on two based te hnologies: XML
and the Resour e Des ription Framework (RDF). RDF allows en oding basi triplets
being rather like the subje t, verb and obje t of an elementary senten e. Slightly more
formally, it onsists of a subje t, a predi ate (also known as a property of the triplet)
and an obje t ea h of whi h is de ned by its URI. Alternatively, the obje t may be
spe i ed by a literal. This simple ar hite ture allows anyone to de ne new on epts
and verbs by de ning new URI for it and to \say anything about anything" by de ning
new RDF triplets. Ea h set of RDF de nitions reates a RDF graph where the nodes
orrespond to the on epts (subje ts and obje ts) and the edges orrespond to the
predi ates. Figure 2.2 shows an example of a RDF graph that spe i es an address of
a person. The ovals orrespond to the on epts de ned by their URIs (a sta member
and an address), the re tangles represent the literals.
The last important omponent of the semanti web is formed by ontologies. Basi-
ally, an ontology is a do ument or le that formally de nes the relations among terms.
For spe ifying the ontologies in the semanti web environment, the Web Ontology Lan-
guage (OWL) has been proposed.

9
http://www.example.org/staffid/85740

http://www.example.org/terms/address

http://www.example.org/addressid/85740
http://www.example.org/terms/zip http://www.example.org/terms/street

http://www.example.org/terms/state
http://www.example.org/terms/city

01730 Bedford Massachusetts 1501 Grant Avenue

Figure 2.2: An example of an RDF graph (taken from [44℄)

2.6 Current Information Extra tion Alternatives


As stated in the introdu tion, it is not a trivial task to pro ess the information on-
tained in the HTML do uments mainly be ause the do uments don't ontain any formal
de nition of the ontained data semanti s. The web servi es and the semanti web are
often regarded as possible alternatives, that an be used for giving the ontent the
semanti s and allow better pro essing of the ontent.
The web servi es are suitable for appli ations, where the resulting data are highly
stru tured and have a xed format that is spe i ed in the servi e de nition. The
resulting data are expe ted to be a produ t of some pro ess that an be optionally
in uen ed by some input parameters. The drawba k of web servi es is that they re-
quire the installation and maintenan e of an indispensable amount of quite ompli ated
software on the server that an, in addition, signi antly in rease the requirements to
the server performan e. Creating a web servi e is not a trivial task neither. It requires
spe ial tools and reating the software pie es that implement the web servi e pro ess
itself. When omparing this quite ompli ated te hnology with the simpli ity of the
traditional web, in most ases the possible bene ts are not worth all the e ort.
The semanti web, in omparison to the web servi es, is mu h more exible and
its implementation on the server side onsists mainly of reating the appropriate XML
do uments that ontain the RDF data. On the other hand, in order to be usable, it
requires the implementation of software agents that would join together the distributed
pie es of information in order to infer the desired results. Unfortunately, the entire
semanti web is still in more or less theoreti al stage.
The last important issue is not te hnologi al. Very often, the information is pre-
sented on the web with the only purpose of attra ting the users that read the adver-
tisement and bring pro t to the servi e providers. It is not the aim of the information
providers to allow better automati pro essing of the presented data.
When onsidering all these fa ts, the servi e providers are not motivated enough
to adopt the existing or proposed semanti te hnologies and even the bene ts of these
te hnologies for the end-users are questionable. The result is obvious { while the
number of available HTML do uments grows exponentially, the semanti te hnologies
are pra ti ally not used. In this situation, we must a ept urrent state of the web as it

10
is and to state, that extra ting information from the unsophisti ated HTML do uments
is urrently the only way how to a ess the information available in the web.

11
Chapter 3

State of the Art


Our approa h to information extra tion ombines several areas of endeavour that have
been intensively investigated re ently. Firstly, it is the area of information extra tion
from HTML do uments that mainly fo uses on the problems related to wrapper gener-
ation and wrapper indu tion. And se ondly, it is the area of advan ed modeling of the
stru ture of do uments and the dis overy of semanti stru tures in the do uments.
As mentioned in se tion 1.1, the hypertext apability of HTML allows to reate
larger information entities onsisting of multiple do uments linked together that are
usually alled logi al do uments. A omplex approa h to information extra tion there-
fore in ludes the analysis of the logi al do uments.
First, we give an overview of urrent approa hes to individual, \physi al" HTML
do uments. Next, we dis uss the area of modeling the logi al organization of do u-
ments. This area presents an alternative look to the HTML do ument modeling and
it is appli able to information extra tion. And nally, urrent te hniques of the logi al
do ument dis overy are dis ussed.
3.1 Information Extra tion from HTML Do uments
The information extra tion has been largely investigated in the plain text ontext long
before the World Wide Web emerged. It has been used for pro essing the ele troni mail
messages, network news ar hives et . [50℄. Many information extra tion te hniques for
various types of ele troni messages have been proposed within the frame of the Message
Understanding Conferen es (MUC) [31℄ that were in progress from 1987 to 1998. For
ea h MUC, parti ipating groups have been given sample messages and instru tions on
the type of information to be extra ted, and have developed a system to pro ess su h
messages. These systems have been then evaluated for the onferen e.
The World Wide Web and the HTML language bring a new look to information
extra tion. In ontrast to the plain text messages, HTML allows to de ne the visual
presentation of the ontent. This possibility is often used for making the do uments
learer and easily understandable. The data in HTML do uments is presented in a
more or less regular and stru tured fashion. For this reason, the HTML do uments are
often regarded as a semistru tured information resour e [38℄. The reader is not for ed

12
...
Capital Cities <h1>Capital Cities</h1>
<b>Fran e</b> - <i>Paris</i>
Fran e { Paris <b>Japan</b> - <i>Tokyo</i>
Japan { Tokyo ...

Figure 3.1: Example of a simple do ument and its ode


to read the whole do ument. On the ontrary, a ording to many studies, the readers
only s an the do ument looking for some interesting parts instead of reading it word by
word [47℄. Due to this behaviour of the web users, writing text for the World Wide Web
has be ome a spe i area for whi h the term web opywriting is frequently used. One
of the major requirements to the text on the web is that the organization of the page
is lear and the user an easily nd the part of the do ument ontaining the desired
information. For this purpose, the do uments ontain a system of navigational ues that
have mostly visual hara ter. During the years of the World Wide Web development,
these te hniques of data presentation have been brought almost to perfe tion so that
reading the do uments using the web browser be omes relatively eÆ ient.
From the point of view of the automated do ument pro essing, the situation is
di erent. HTML doesn't ontain suÆ ient means for the ma hine-readable semanti
des ription of the do ument ontent. The te hniques for natural language pro essing
that have been used for the information extra tion from plain text are not appli able,
be ause HTML do uments usually don't ontain many whole senten es or blo ks of
ontinuous text. On the other hand the HTML tags inserted to the text of the do ument
provide additional information that an be used for identifying the data. In the gure
3.1, there is an example of a simple do ument and the relevant part of its HTML ode.
Let's imagine that we want to extra t the names of ountries and their apitals from
this do ument. When looking at the HTML ode, we an noti e that ea h name of a
ountry is surrounded by the <b> and </b> tags and a ordingly, ea h name of a apital
is surrounded by the <i> and </i> tags. Thus for extra ting the desired information
from this do ument, a simple pro edure an be reated that reads the do ument ode,
dete ts these tags and stores the text between ea h pair of the tags. Su h a pro edure
is alled a wrapper. Apparently, the given example is quite trivial. For more omplex
do uments, more sophisti ated wrappers have to be designed.
For the data in the HTML do uments, a database terminology is usually used in the
information extra tion ontext. We assume that the do uments ontain one or more
data re ords where ea h re ord onsists of some number of data elds. Usually, we
admit that some re ords are in omplete; i.e. that the values of some elds are missing
in the do ument. For example, gure 3.1 shows a do ument ontaining two re ords
where ea h re ord has two elds: the name of the ountry and the name of its apital
ity.

13
Wrapper

Extracted data

Extarction rules
HTML Documents

Figure 3.2: Information extra tion using a wrapper

...
Capital Cities <i>Capital Cities</i>
<b>Fran e</b> - <i>Paris</i>
Fran e { Paris
<b>Japan</b> - <i>Tokyo</i>
Japan { Tokyo
...

Figure 3.3: Ambiguous elds in a do ument


3.1.1 Wrappers

A ording to Kushmeri k [38℄, a wrapper is a pro edure that provides the extra tion
of parti ular data in the HTML do ument as illustrated in the gure 3.2. For the
identi ation of the parti ular data in the do ument, the wrapper uses either a set of
extra tion rules that de ne the way of the identi ation of ea h individual data eld or
a model of the do ument that is used for the de ision of whi h part of the do ument
orresponds to the parti ular data value (for example, the Hidden Markov Models are
used in some methods). With wrapper onstru tion, we mean the pro ess of formulating
the extra tion rules or the model of the do ument for a parti ular information extra tion
task.
Kushmeri k in [38℄ de nes six lasses of wrappers with in reasing expressiveness
that di er in the way, how the extra tion rules are de ned. The simplest wrapper lass
is alled LR (left-right). In this lass, for ea h data eld to be extra ted, one extra tion
rule is de ned. Ea h rule is a pair of strings that delimit the eld in the do ument ode
from the left and from the right. Let's onsider again an example of a simple do ument
shown in the gure 3.1. An LR wrapper for this task an be de ned as a set of two
rules:
W = f[ ℄[
< b >; < =b > ; < i >; < =i > ℄g
where the rst rule identi es the values of ountry and the se ond one identi es the
values of apital .
This way of de ning the rules is apparently not usable for all the tasks; let's onsider
a slightly modi ed example in the gure 3.3. When the above LR wrapper were invoked
on this do ument, the aption would be in orre tly identi ed as a ity name. As a

14
solution, more omplex wrapper lasses have been de ned:
 HLRT (head-left-right-tail). Two additional delimiters are used; the head de-
limiter is used to skip potentially onfusing text in do ument heading and the
tail is used to skip potentially onfusing text in the bottom of the do ument.
 OCLR (open- lose-left-right). The open and lose delimiters identify ea h re ord
in the do ument.
 HOCLRT (head-open- lose-left-right-tail). An analogi al ombination of the
above two lasses.
 N-LR and N-HLRT. The modi ation of the LR and HLRT lasses for handling
nested tabular data in do uments.
A ording to Kushmeri k, the six wrapper lasses were able to handle about 70%
of the real web do uments. It is obvious that the wrapper works properly for a limited
set of do uments that orrespond to previously de ned extra tion rules. In literature,
su h a set of do uments is ommonly alled a do ument lass. In most ases, the
do ument lass onsists of do uments of the same topi generated automati ally from
a ba k end database by an identi al pro edure or at least reated by the same author.
Moreover, the wrapper works until the data presentation hanges. As results from a
simple omparison of the do ument in the gures 3.1 and 3.3, even a minor hange
in the do ument design an ause the wrapper to stop working properly. Due to the
distributed and dynami nature of the Web, this state annot be predi ted and sin e no
additional information about the extra ted data is provided, it is not trivial to dete t
the malfun tion of the wrapper automati ally.
When using wrappers for integrating information from many sour es, for ea h sour e
a wrapper or wrappers must be reated and when some onditions hange, the wrappers
must be modi ed appropriately. From this point of view, the method how the wrappers
are being onstru ted is important. The most obvious method is writing the wrappers
by hand; i.e. by analyzing a set of do uments to be pro essed and determining the
delimiting strings. This method is very time- onsuming and error-prone; unfortunately,
it is the most used method urrently. The ompanies employ people that work on
oding new wrappers and maintaining the old ones. Sin e this approa h presents a
serious s alability problem, many approa hes have been developed for an automati
inferen e of wrappers.
Most methods for the automati wrapper onstru tion are based on wrapper in-
du tion. This approa h is based on ma hine learning algorithms and the wrapper
onstru tion pro eeds in following phases:
1. A supervisor provides a set of training samples (i.e. labeled HTML pages)
2. A ma hine learning algorithm is used to learn the extra tion rules
3. A wrapper is generated based on the extra tion rules
4. The wrapper is used on the target do uments

15
html

Capital Cities head body

Country Capital
Fran e Paris
title h1 table

Japan Tokyo tr tr tr

td td td td td td

Figure 3.4: An HTML ode tree


For the inferen e of the extra tion rules, a model of the do ument text and the
embedded tags must be reated. The hara ter of this model depends on the ma hine
learning algorithm used. In following se tions, we give an overview of the used models
of the do ument and the existing methods for the wrapper indu tion.
3.1.2 Do ument Code Modeling

The most straightforward model is to represent the do ument ode simply as a string
of hara ters. In this representation, the text of the do ument is not expli itly distin-
guished from the embedded tags. When pro essing the do uments represented this way,
usually the extra tion rules based on delimiting substrings [57℄ or regular expressions
[3℄ are used. For example, the words that end with a olon an introdu e an important
data value. Su h phrases an be found using the regular expression [A-Za-z0-9 ℄+[:℄.
For using the ma hine learning algorithms, it appears that is more suitable to use
a di erent representation where the basi unit is a word instead of a hara ter. The
do ument is represented as a sequen e of words [15, 26, 35, 49℄. Ea h word an be
assigned various attributes based some of their orthographi al or lexi al properties. The
embedded HTML tags an be either omitted or used for inferring additional attributes
of the individual words [26, 49℄ (e.g. the text is in a aption). A spe ial ase of su h a
model is used by [35℄. In this model, on the ontrary, the HTML tags are regarded as
the symbols of an alphabet ; any text string between ea h pair of subsequent tags is
represented with a reserved symbol . x

Most ommon is a hierar hi al model of the HTML ode that represents the nesting
of the tags in the do ument [12, 17, 21, 22, 23, 37℄. Figure 3.4 shows an example of
a simple do ument and the orresponding tree of the HTML tags. In order to make
it possible to reate su h a model for an HTML, it is ne essary to pre-pro ess the
do ument so that we obtain so alled well-formed do ument [12℄ where all opening
tags have the orresponding losing tag and they are properly nested. The XHTML
do uments are always well-formed. The text ontent of the do ument is then ontained
in the leaf nodes of the tree or it is not in luded in the model at all. It is the advantage
of the hierar hi al model that it des ribes the relations among the tags in addition to
the observed properties of individual words and tags.

16
3.1.3 Wrapper Indu tion Approa hes

Current methods of the wrapper indu tion are based on the knowledge from various
areas of the resear h work. Many approa hes are based on grammati al or automata
inferen e [18, 25, 35, 37, 38℄, other approa hes use the relational ma hine learning
algorithms [17, 21, 26, 57℄. Quite di erent approa h is presented by the methods based
on on eptual modeling [22, 23℄. Note that this is a oarse lassi ation only and the
di erent approa hes in uen e ea h other.
As mentioned in se tion 3.1.1, the wrapper indu tion approa hes require a set of
labeled examples of the do uments that are used for inferring the extra tion rules.
A ording to the arti ial intelligen e terminology, we all this set of examples a training
set and the pro ess of inferring the extra tion rules the wrapper training.
For evaluating the performan e of the individual information extra tion approa hes,
there are two ommonly used metri s: the pre ision and re all [37℄. They are
P R

de ned as follows:
P =

i
(3.1)
R =

n
(3.2)
where is the number of orre tly extra ted re ords, is the total number of extra ted
i

re ords and is the real number of the re ords in the do ument.


n

Methods Based on the Grammati al Inferen e


The grammati al inferen e is a well-known and long studied (early studies in the 60's)
problem and its appli ation to information extra tion is therefore supported by a large
theoreti al foundation. The problem is following: We have a nite alphabet  and
a language   (usually, regular or ontext-free languages are dis ussed in this
L

ontext). Given a set + of the senten es over  that belong to and a (potentially
S L

empty) set of the senten es over  that do not belong to we want to infer a
S L

grammar that generates the language .


G L

The basi idea behind using grammati al inferen e for information extra tion is
that generating a wrapper for a set of HTML do uments orresponds to a problem of
inferring a grammar for the HTML ode of the pages and nally using the inferred
grammar for the extra tion of the data elds. This idea is however not dire tly appli-
able to the do ument pro essing. The main obsta le is that there are only positive
examples available in the web; i.e. the available do uments. As it follows from Gold's
work [30℄, regular nor ontext-free grammars annot be orre tly identi ed from the
positive samples only. This problem an be basi ally solved by limiting the language
lass to a sub lass of regular languages that is identi able in the limit (e.g. -reversible
k

languages) or by hanging the omputational model (arti ial negative samples or sup-
plying an additional information).
One of the approa hes to the information extra tion is presented in [25℄. For lo-
ating a parti ular data eld in the do ument, a ombination of a Bayes lassi er and
grammati al inferen e is used. The do ument is modeled as a sequen e of words. The

17
Bayes lassi er pro esses parts of the do ument determined by a oating window of
a xed length of words. To ea h position of the window, a probability is assigned
n

that the parti ular part of the text mat hes to the parti ular data eld (e.g. the name
of a person). The problem of this method is in determining the exa t boundaries of
the data eld (the window of a xed size). Moreover, the lassi ation doesn't on-
sider the word order; the Bayes lassi er only works with the o urren e of individual
words. These problems are solved by the grammati al inferen e. The words on-
tained in the do ument are onverted to abstra t symbols from an alphabet  using
their orthographi al properties. Thus the alphabet  ontains symbols of the type
word-lower+dr (an abbreviation \Dr." written arbitrarily in upper- ase or lower- ase
letters), apitalized-p t (any word beginning with apital letter) et . The training
set is used for both training the Bayes lassi er and for inferring a nite automaton,
where ea h terminating state is assigned a probability that the a epted string orre-
sponds to the parti ular data eld. When the string is not a epted at all, it is assigned
a small probability. The result is then a produ t of the probabilities from the Bayes
lassi er and from the automaton.
Another approa h to the appli ation of grammati al inferen e is presented by Kosala
[37℄. This method is based on tree languages. The do ument is modeled a ording to
its HTML ode as a tree where the nodes ontain the name of the HTML tag or a text
string. Instead of a set of strings over  we obtain a set of trees over a new alphabet
V where ea h tree orresponds to a parti ular do ument from the training set. In the
sample trees, the data eld to be extra ted is repla ed by a spe ial symbol . Then, the
x

aim is to infer a deterministi tree automaton that a epts the trees in whi h the desired
data value is repla ed by . When using this automaton for information extra tion, we
x

repla e subsequently the nodes of the do ument tree by . On e the resulting tree is
x

a epted by the automaton, the extra tion result is the original string that has been
repla ed by .
x

A di erent way of the grammati al inferen e appli ation is presented by [35℄. This
work is based on sto hasti ontext-free grammars. The input alphabet is formed by
the HTML tags and an extra symbol text that represents any non-empty text string
between a pair of tags. During the grammati al inferen e pro ess, the omplexity of
the grammar is evaluated and the simplest grammar is hosen. The non-terminals
of the inferred grammar orrespond to basi parts of the do ument. For more exa t
lo alization of the data elds, regular expressions are used that represents the domain-
spe i knowledge.
A ompletely new view of the problem is presented by [18℄. The presented approa h
deals with the problem of s hema dis overy { given a set of HTML do uments we are
looking for a ommon s hema of their ontent and the extra tion rules based on the
dis overed s hema. The s hema dis overy is based on omparing the do uments { the
parts that are present in all the do uments are onsidered as stati ontent whereas
the hanging parts orrespond to the data values.

18
Hidden Markov Models
A Hidden Markov Model (HMM) is a nite state automaton with sto hasti state
transitions and symbol emissions. The automaton models a probabilisti generative
pro esses whereby a sequen e of symbols is produ ed by starting at a designated start
state, transitioning to a new state, emitting a new symbol { and so on until a designated
nal state is rea hed.
The appli ation of HMMs to information extra tion is based on a hypotheti al as-
sumption that the text of a do ument has been produ ed by a sto hasti pro ess and
we attempt to nd a Markov model of this pro ess. The states of the model are asso i-
ated to the tokens to be extra ted. The model transition and emission probabilities are
learned form training data. The information extra tion is performed by determining
the sequen e of states that was most likely to have generated the entire do ument and
extra ting the symbols that were asso iated with designated target states.
This approa h is used for example by [27℄. For ea h eld to be extra ted a separate
HMM is used that onsists of two types of states { the target states that produ e the
tokens to be extra ted and ba kground states. Ea h of the HMMs models the entire
do ument so that no pre-pro essing is needed and the entire text of the do uments
from the training set is used to train the transition and emission probabilities.
Relational Learning Approa hes
General prin iple of these te hniques is similar to the above. Again, we assume that
there exists a lass of do uments with similar properties and that there is a training
set of the do uments from this lass available. In the training set, we des ribe some
properties for ea h data eld to be extra ted using logi al predi ates. Then, we use
relational learning algorithms for indu ing general rules that identify the data elds in
do uments.
Freitag [26℄ assigns ea h word in the do uments ertain attributes based on the
properties of the given portion of the text su h as word length, hara ter type (letters,
digits) or orthography and adds some additional attributes that des ribe the relation
between the word and the surrounding HTML tags (e.g. the word forms part of a
heading or the word forms a table row). Ea h data eld to be extra ted is then des ribed
by logi al predi ates based on these attributes and using an algorithm SRV (based on
the FOIL algorithm [53℄) a general rule is inferred that identi es the data eld in the
do ument. DiPasquo [21℄ extends this approa h by modeling the hierar hi al stru ture
of HTML tags in the do ument, whi h allows des ribing the relations among the HTML
tags more exa tly.
Soderland [57℄ uses an approa h that is based on the methods for natural language
pro essing. Ea h word of the do ument a semanti lass is assigned (e.g. Time, Day,
Weather ondition) using prede ned on ept de nitions. Using the learning algorithms
AQ and CN2 [16℄ a general des ription of ea h data eld is inferred.

19
3.1.4 Alternative Wrapper Constru tion Approa hes

Wrapper indu tion is not the only method for automati wrapper onstru tion. Fol-
lowing te hniques are based on dire t analysis of the HTML ode of the parti ular
do uments to be pro essed. The aim of these te hniques is to avoid the training phase
and eliminate the requirement of the training set of do ument. On the other hand,
these te hniques are usually based on empiri al heuristi s and it is often hard to spe -
ify exa tly for whi h do ument the method is suitable. Furthermore, some prede ned
domain-spe i knowledge is often required and in some ases (e.g. [33℄) the method
is language dependent.
HTML-related Heuristi s
These te hniques are based on spe i empiri al heuristi s related to the HTML lan-
guage generally or to some generally a epted ways of its usage.
Ashish [3℄ lo ates some important words that introdu e an important information
in the do ument (e.g. Geography, Transportation, et .) { so alled tokens. The tokens
are identi ed based on the properties of the text and surrounding HTML tags and all
the possible o urren es are rmly de ned by regular expressions. For example, the
text between the <b> and </b> tags, words in headings, text that ends with a olon,
et . Ea h token indi ates the start of a se tion of the do ument. Next, the hierar hi al
stru ture of se tions is built by omparing the font size and the indentation of the
text that begins ea h se tion. The proposed extra tion tool ontains a graphi al user
interfa e for an intera tive adjustment of the tokens and the hierar hi al stru ture.
Finally, the wrapper is generated using the YACC generator.
Another approa h is used by [12, 22℄. This approa h assumes that there an be
found a uni ed separator of the data re ords in the do ument. The do ument is modeled
as a tree of tags and based on various heuristi s, a general stru ture is dis overed that
is used as a re ord separator. As the next step, a data eld separator is lo ated similar
way. The heuristi s are based on the statisti al analysis of the text in potential se tions,
repeating patterns, et . Furthermore, a prede ned knowledge about the meaning of
some HTML tags is used. Similar approa h is used in [43℄. The proposed MDR
algorithm attempts to lo ate the regions of the do ument tag tree that potentially
ontain data re ords. In these regions, one or more data re ords an be identi ed.
Con eptual Modeling
Con eptual modeling approa h is more ommon in the area of the information extra -
tion from plain text do uments; however, it an be used for HTML do uments too.
For example Embley et al. [22, 23℄ proposes a method where as the rst step, an
ontologi al model of the extra ted information is reated and based on this model, or-
responding data re ords are dis overed in the do ument. It is possible to ombine this
approa h with the HTML ode analysis des ribed above. The main di eren e is that
the stru ture of the information is not inferred from the do ument but it is known in
advan e.

20
3.1.5 Computer-aided Manual Wrapper Constru tion

This ategory is formed by spe ial tools that generate wrappers in ollaboration with
a human expert. These tools usually provide a graphi al user interfa e that allows
the wrapper reator to analyze the do uments to be pro essed and to easily design a
wrapper.
The DoNoSe tool [1℄ works mainly with plain text do uments. The tool allows hier-
ar hi al de omposition of the ontained data and mapping sele ted regions of the text
to omponents of the data model. LiXto [5℄ is a fully visual intera tive system for the
generation of wrappers based a de larative language for the de nition of HTML/XML
wrappers. Both tools provide a graphi al user interfa e that allow the user with no
programming experien e to produ e the appropriate wrappers.
3.2 Cas ading Style Sheets from the IE Perspe tive
With the new te hnologies being introdu ed to WWW, some riti al disadvantages
of the wrapper approa h appear. For example, a ording to the re ommendations
of the WWW Consortium the usage of Cas ading Style Sheets (CSS) [7℄ has grown
signi antly in the last few years. This te hnology allows de ning the visual layout and
formatting of an HTML or XML do ument independently on its ontent. This property
is parti ularly useful when the HTML do uments are being generated dynami ally (e.g.
from a database) sin e it allows modifying the visual appearan e of the pages without
modifying the HTML generator. On the other hand, this signi antly redu es the
amount of information that an be used by a wrapper for identifying the information
in HTML do uments.
Figure 3.5 shows an example of a traditional HTML do ument formatting. It is
obvious that all the names of the ountries are denoted by the <b> and </b> tags. A
wrapper for extra ting ountries from the do ument simply looks for these tags and
extra ts the inserted text.
...
<h1>Capitals</h1>
<b>Fran e</b>
- <i>Paris</i>
<b>Japan</b>
- <i>Tokyo</i>
...

Figure 3.5: Do ument formatting using HTML tags

One of the possible variants of the same do ument ode written using CSS is shown
in the gure 3.6. The above method for de ning the wrapper fails in this ase be ause
all the elements are denoted by the same <span> tag. Moreover, there are several ways
how to in orporate CSS to the HTML ode (in our example, we an see the lasses

21
de ned by the lass attribute and inline styles de ned by the style attribute of the
HTML tags).
...
<span lass="heading">Capitals</span>
<span lass=" ountry">Fran e</span>
- <span style="font-style: itali ;">Paris</span>
<span lass=" ountry">Japan</span>
- <span style="font-style: itali ;">Tokyo</span>
...

Figure 3.6: Do ument formatting using CSS

As we an see, all the HTML tags have been repla ed with a single <span> tag
that is used for spe ifying the CSS lass of the individual parts of the text. Moreover,
as mentioned in se tion 2.2, the style de nitions an be in orporated di erent ways to
the HTML ode. In any ase, the result is that the \semanti " HTML tags su h as
headings, emphasis, et . may be ompletely removed from HTML and repla ed by CSS
de nitions. This hange signi antly ompli ates or even makes unusable most of the
wrapper indu tion methods mentioned above.
From this point of view, it is not reliable to onstru t wrappers that rely dire tly
on the HTML ode. HTML and CSS are only the means for reating do uments and
they an be used various ways. The xed point in this variable world is the nal
presentation of the do ument that has been usually arefully designed by experts from
various bran hes and that must be delivered to the reader in un hanged form regarding
espe ially the visual appearan e and the stru ture of the presented do ument.
For this reason, instead of modeling the HTML ode dire tly, some more sophisti-
ated models need to be proposed that des ribe the do uments from the perspe tive
of its nal presentation. These models should des ribe the organization and the visual
appearan e of the do uments as it is expe ted to be per eived by a human reader.
3.3 Advan ed Do ument Modeling
3.3.1 Logi al versus Physi al Do uments

The World Wide Web onsists mainly of HTML do uments that may referen e ea h
other using hypertext links. In this thesis we will all these do uments physi al do u-
ments be ause ea h of them orresponds to a physi al le stored on a WWW server.
However, the hypertext links allow splitting a omplex information to multiple physi al
do uments where ea h of them ontains a spe i part of the information and these
individual parts are inter onne ted by the hypertext links. This way of presentation is
very frequent in the World Wide Web. We will all su h a set of physi al do uments
that form a omplete information entity a logi al do ument. An example of a logi al
do ument is given in the gure 3.7. The arrows represent links among the physi al

22
Figure 3.7: Logi al do ument

do uments. The three physi al do uments in the dashed box form a logi al do ument.
As we an see, any of the do uments that form the logi al do ument an ontain links
to other external do uments that do not belong to the logi al do uments. Sin e it is
not spe i ed in the do uments, whi h of the links are external, the dis overy of logi-
al do uments in the web is not a trivial task and it requires further analysis of the
do uments and the links.
3.3.2 Logi al Do ument Dis overy

The task of the logi al do ument dis overy onsists of lo ating all the physi al HTML
do uments that form a logi al entity alled logi al do uments. The input is the URI of
the main page (sometimes also alled the top page or the index page), whi h is intended
by the author of the do ument to be an entry point to the logi al do ument and it is
usually dire tly a essed by the users1. The output is the list of the URIs of all the
HTML do uments that form the logi al do ument.
The primary sour e of information for dis overing the related physi al do uments
are the HTML links. An HTML do ument that forms part of a logi al do ument, ex ept
the main page, must be referen ed by at least one other HTML do ument in the same
logi al do ument. The pro ess of logi al dis overy is therefore quite straightforward:
 We nd the URIs of all the do uments referen ed in the main page
 We sele t all URIs that point to the do uments that belong to the logi al do u-
ment
 We repeat the pro ess re ursively for ea h sele ted do ument
The major problem that has to be solved is in the se ond step: How to distinguish
the do uments that belong to the logi al do ument from the remaining ones? There
exist two di erent types of information that an be used for resolving this task:
1
The URI of the main page is usually publi ly available, in ontrast to the URIs of the remaining

do uments

23
 Do ument lassi ation. We assume that the individual physi al do uments that
form the logi al do ument are more similar to ea h other than the remaining
referen ed do uments. The similarity of do uments is usually omputed using
the methods based on the term frequen y in the do ument su h as the  tf idf

method [55℄.
 Do ument layout analysis. We analyze and ompare the layout of the do uments
as mentioned for example in [42℄
 Link topology analysis. In general, the topology of the links among a set of
HTML do uments an be represented as a dire ted graph. By the analysis of this
graph using spe i heuristi s we an dete t a subgraph with ertain properties.
This te hnique is also used for dete ting so- alled ommunities in the web [29℄.
Additionally, some limitations an be put to the format of the URIs (e.g. all
do uments must be pla ed on the same web server, et .)
Usually, these types of information are used together. Tajima et al. [60℄ shows that
most of the logi al do uments are organized into hierar hies. The approa h is based on
the assumption that the authors in lude some hypertext links to the do uments that are
intended to be used by the users as a standard way of getting to a parti ular do ument.
These links form so- alled standard navigation paths. There are, of ourse other links
that either point outside of the logi al do ument or have a lower importan e su h as
links ba k to the main page. The proposed method of logi al do ument dis overy has
two steps. First, the hierar hi al stru ture is dis overed by identifying paths intended
by the authors of the do uments to be the standard navigation routes from the main
page to other pages. Then, the dis overed hierar hy is divided into sub-hierar hies
orresponding to the logi al do uments based on omparing the do ument similarity
using the  method.
tf idf

3.3.3 Logi al Stru ture of Do uments

The information extra tion methods des ribed in the se tion 3.1 have been based on a
dire t analysis of the ode of HTML do uments. We an say that these methods work
with the physi al realization of the do uments. The bottlene k of this approa h is too
tight binding of the wrapper to the HTML ode. The nature of HTML allows to a hieve
the desired do ument design by various ways that an be arbitrarily ombined, whi h
makes the wrappers limited to a narrow set of do uments and a short time period.
As an answer to these drawba ks, there have been several attempts to des ribe the
do uments from a logi al point of view.
The logi al stru ture of a do ument [2, 13, 58℄ is basi ally a model of a do ument
that des ribes the relations among the logi al se tions of the do ument as for example
se tions, paragraphs, gures, aptions, et . There are generally two approa hes to the
logi al do ument stru ture dis overy:
 Visual do ument analysis { we analyze the visual aspe t of the do uments. This
approa h is appli able to any ele troni do ument format su h as PostS ript,

24
PDF, HTML, et . For HTML do uments, many approa hes to the visual analysis
have been published [14, 32, 36, 48, 62℄.
 Dire t analysis of the hierar hy of tags in the do ument ode { we assume that
the nesting of the tags orresponds to the logi al stru ture of the do ument. This
approa h is appli able to the markup languages only. It is more reliable from
the omplexity point of view, but it is not usable in all ases (for example when
the Cas ading Style Sheets are used in ertain way) sin e the hierar hy of tags
needn't ne essarily orrespond to the logi al stru ture.
In this thesis, with the notion of logi al do ument stru ture we understand the
logi al stru ture of any do ument, either the physi al or the logi al one depending on
ontext. The notion of logi al do ument stru ture has been introdu ed by Summers
[58, 59℄ in the ontext of pro essing the PDF, PostS ript and s anned-in do uments
and it is de ned as a hierar hy of segments of the do ument, ea h of whi h orresponds
to a visually distinguished semanti omponent of the do ument. Other authors use
the notions of do ument stru ture tree [36℄ or do ument map [63℄ in similar sense.
For some time, it has been assumed that in ase of HTML do uments, there is no
need for modeling the logi al stru ture be ause it is dire tly present in the do ument
in the form of the HTML tags. However, a loser analysis shows that there is only a
very loose binding between the HTML tag tree and the logi al stru ture of an HTML
do ument. The reason is that the HTML provides both stru tural and presentational
apabilities that an be arbitrarily ombined. Furthermore, the e ort of the do ument
authors aims to the resulting visual presentation rather than the logi ally orre t HTML
ode so many tags are often misused. Thus reating the logi al stru ture of an HTML
do ument is not trivial neither and it requires more detailed analysis of the do ument.
In almost all works published on the logi al do ument stru ture analysis, the re-
sulting model is a tree, where the nodes orrespond to individual logi al parts of the
do ument. This model is based on the observation that ea h HTML do ument onsists
of elements that spe ify information arrying obje ts at di erent levels of abstra tion
through obje t nesting. The logi al stru ture an be viewed as that the obje ts of a
higher level of abstra tion are des ribed by obje ts of ner levels of abstra tion [14℄.
Su h a hierar hi al on eption of a do ument organization seems to be natural to the
do ument authors as well as the readers. This fa t has been observed by many authors,
for example [1, 4, 14, 20, 49, 58, 60℄ without any more detailed reasoning. We believe
that the main reasons of the hierar hi al do ument organization are:
 It is eÆ ient. Hierar hi al organization of a do ument allows better orientation
in the text. It is possible to nd the se tion that deals with a parti ular topi
without having to read the whole do ument.
 It is feasible. A standard do ument is linear { it has a beginning and the end. The
hierar hi al organization an be easily rea hed by using various levels of headings
and labels. The organization is then apparent to the reader, espe ially when the
table of ontents is in luded. It is not feasible to a hieve some more omplex
organization su h as a general graph without onfusing the reader. The situation

25
is di erent in ase of the logi al hypertext do uments; this problem is dis ussed
separately in se tion 5.7.
 It is natural. The hierar hi al organization of a stru tured text has been widely
used in te hni al and popular arti les and books. People are used to it. This is
a tually a onsequen e of the above two reasons.
For the above reasons, the authors of stru tured do uments mostly prefer the hier-
ar hi al organization and the readers automati ally expe t it. Therefore, a tree appears
to be suÆ ient for modeling all kinds of do uments.
3.3.4 Visual Analysis of HTML Do uments

There are several approa hes to reating the model of the logi al stru ture for HTML
do uments. They di er in the granularity of the resulting model that depends on its
intended appli ation.
The work of Car hiolo et al. [13℄ deals with the dis overy of the logi al s hema of
a web site that ontains multiple do uments. For this purpose, basi logi al se tions
are lo alized in ea h do ument su h as logi al heading, logi al footer and logi al data
where the semanti of the page is mainly pla ed. The proposed approa h is based
on a bottom-up HTML ode analysis. First, olle tions of similar ode patterns are
lo alized in the do ument. As the se ond step, ea h se tion is assigned a meaning
(e.g. logi al header) based on the semanti s of the HTML tags (e.g. the <form> tag
denotes an intera tive se tion) or on some information retrieval te hniques (e.g. the
header se tion is the olle tion that refers to the text in the title of the do ument
or to the URI of the page). Similarly, [62℄ dis overs semanti stru tures in HTML
do uments. This approa h is based on the observation that in most web pages, layout
styles of subtitles or data re ords of the same ategory are onsistent and there are
apparent boundaries between di erent ategories. First, the visual similarity of the
HTML ontent obje t is measured. Then, a pattern dete tion algorithm is used to
dete t frequent patterns of visual similarity. Finally, a hierar hi al representation of
the do ument is built. The method des ribed in [48℄ is based on similar prin iple. A
key observation of this method is that semanti ally related items in HTML do uments
exhibit spatial lo ality. Again, a tree of HTML tags is built and similar patterns of
HTML tags are dis overed. Finally, a tree of the dis overed stru tures is built.
While the above methods dis over the logi al stru ture of the do ument to the level
of basi semanti blo ks, the work of Chung et al. [14℄ is more oriented to information
extra tion. Based on the visual analysis, it attempts lo ate data elds in the do uments
and store them in an XML representation. It is assumed that the do uments being
pro essed pertain to a parti ular, relatively narrow ontology. Furthermore, ertain
domain knowledge provided by the user is ne essary in the form of topi on epts and
optional on ept onstraints. Ea h on ept is des ribed by a set of on ept instan es
that spe ify the text patterns and keywords as they might o ur in topi spe i HTML
do uments. By ontrast, the topi onstraints des ribe how on epts as information
arrying obje t an be stru tured. As the rst step, a majority s hema of the do ument

26
is inferred in the form of a do ument type de nition (DTD). Next, the data elds
orresponding to the individual on epts are dis overed.
The last des ribed approa h is quite di erent. In [32℄ a method for the web ontent
stru ture dis overy is presented that is based on modeling the resulting page layout
and lo ating basi obje ts in the page through proje tions and two basi operations
- blo k dividing and merging. The proje tion allows to dete t visual separators that
divide the page into smaller blo ks and again, adja ent blo k an be merged, if they
are visually similar. This dividing and merging pro ess ontinues re ursively until the
layout stru ture of the whole page is onstru ted.

27
Chapter 4

Motivation and Goals of the


Thesis
Currently, wrappers are mainly used for obtaining the data from various web sites
in order to reate a servi e that provides a entralized view of the data from ertain
domain available in the WWW. Su h a servi e allows a user to e e tively use the data
available in the web with no need of lo ating the appropriate web sites, browsing the
do uments and lo ating the appropriate data in the do uments. Su h a servi e is most
frequently o ered for the shopping domain { several servi es for omparing pri es of
goods in on-line shops are available (e.g. AmongPri es1 or Compare Online2). Another
good example is the nan ial domain { servi es for omparing sto k quotes, ex hange
rates et .
Although many te hniques for automati wrapper onstru tion have been proposed,
the most used approa h is still the manual or semi-automati wrapper onstru tion.
The ause of this situation are following major problems that appear in di erent extent
in the above mentioned wrapper indu tion approa hes:
 The ne essity of wrapper training. A suÆ iently large set of annotated training
data is required. The training set preparation is time- onsuming; moreover, in
some ases no training data is available at all (e.g. when there is only one instan e
of the page available). A partial solution of this problem is bootstrapping [52℄.
 Brittleness. When some hange o urs in the do uments being pro essed, the
wrapper an stop working properly. The dete tion of this situation is not a trivial
problem [40℄. Moreover, new training data must be prepared and the wrapper
has to be re-indu ed.
 Low pre ision. For pra ti al setting of the wrapper, it is ne essary that the
produ ed data is reliable { i.e. the values of pre ision and re all are lose to
100%. The pre ision of urrent methods varies between 30 and 80% [15, 21, 26, 37℄
depending on the used method and the pro essed do uments.
1
http://www.amongpri es. om/
2
http://www. ompare-online. o.uk

28
There are several auses of these problems. The most important one is too tight
binding of the produ ed wrapper to the HTML ode, whi h makes the wrapper very
sensitive to any minor irregularity in the HTML ode. All the above mentioned wrapper
indu tion methods are based on an assumption that there exists some dire t orrespon-
den e between the HTML ode and its informational ontent. However, the relevan e
of this assumption is also quite questionable. As stated in the introdu tion, HTML
allows de ning a visual appearan e of a do ument, i.e. a presentation and format-
ting of the ontained text. The relation between this formatting and the semanti s
of the do ument ontent is not de ned anywhere. Wrappers based on some assump-
tions regarding this relation that have been de ned based on previous analysis of some
do uments are therefore based on an information whose relevan e is not guaranteed
anywhere and it an be (and usually it is) limited to a ertain set of do ument and
a ertain (unpredi table) time period. Moreover, the do ument an ontain various
irregularities and spe ial features so that the rules indu ed for a do ument needn't to
be ompletely appli able to another do ument event if it belongs to the same lass of
do uments.
As a solution of the above mentioned disadvantages of the wrapper approa h, we
propose developing a new method that would ful ll the following requirements:
1. The do uments are analyzed in time of extra tion. The method shouldn't be
based on any features of the do ument or the set of do uments that have been
dis overed in the past. Using information that was relevant at some time in the
past an lead to in orre t results be ause the do ument may have been hanged
in the meantime.
2. Only the features of the logi al do ument being pro essed are used for information
extra tion. Using some knowledge about the features of other do uments that are
onsidered to be \similar" to the urrently pro essed one an lead to in orre t
results and impre ision.
3. The method must be independent on the physi al realization of the do ument. It
must be based on the nal do ument appearan e that must respe t ertain rules
rather than the underlying HTML/CSS te hnology that an be hosen arbitrarily.
This in ludes the situations when the presented information is split to several
physi al do uments. Always the whole logi al do ument should be analyzed.
These requirements solve the problems of the ne essity of the training set of do u-
ments by analyzing ea h logi al do ument individually. Moreover, this feature should
improve the pre ision of extra tion be ause the method is not based on the extra tion
rules inferred for other do uments. And lastly, the independen e on the physi al re-
alization of the logi al do ument should signi antly redu e the brittleness of inferred
wrappers. The goals of our work an be summarized to following points:
1. To nd a suitable model that des ribes the do uments on suÆ ient level of ab-
stra tion and to de ne this model in a formal way.
2. To propose a new method of information extra tion based on the de ned model
that ful lls the above requirements and resolves the above mentioned problems.

29
3. To evaluate the proposed method experimentally on real WWW do uments.
In following se tions, we present the results of our work that have been a hieved
while ful lling the above spe i ed goals.

30
Chapter 5

Visual Information Modeling


Approa h to Information
Extra tion
As dis ussed in the previous hapter, most problems of the wrapper approa h are aused
by the fa t that the wrappers interpret the HTML ode too literally. This makes the
wrapper very sensitive to any irregularity or a slight hange of the do ument ode. For
removing this stri t dependen e on the HTML ode, it is ne essary to reate a more
general model that would des ribe the do ument abstra ting from the nonessential
implementation details.
The HTML tags embedded in the text of the do ument are an instrument for
a hieving ertain visual presentation of the do ument ontents. Sin e the do uments
in the World Wide Web are designed to be browsed by human users, it seems to
be reasonable to reate a model des ribing the resulting rendered do ument as it is
expe ted to be displayed in the browser window. When designing a do ument, the
authors use the HTML tags for a hieving a parti ular design with following parti ular
aims:
 To make the do ument good-looking from the aestheti and typographi al point
of view.
 To en ourage the user to read the do ument; i.e. to make the do ument visually
\attra tive".
 To make the do ument well arranged; i.e. to allow the user an e e tive orientation
in the do ument with no need of reading ea h individual part of the do ument.
Although from the aestheti and marketing point of view, the rst two points are
important, from the point of view of the automati do ument pro essing the last point
is the only important one. We are not interesting how attra tive the do ument is for
the user but we are interested in what information is the user given in order to be able
to qui kly understand the organization of the do ument. There exist several ommonly
a epted rules in the visual presentation of the do uments that are used for giving

31
the reader this information. These rules have been established during the evolution
of typography long before rst ele troni do uments appeared. The most ommon of
them are following:
 The parts of the text that deal with di erent topi s or that have a di erent pur-
poses are visually separated; e.g. the individual arti les in a newspaper, footnotes
in a book, et .
 Bold text, itali s and underlining are used to stress the signi an e of a parti ular
part of the text. In modern typography, di erent olors of the text are sometimes
used for highlighting a parti ular word or a senten e.
 The larger font is used for typing a parti ular text, the more important this text
is. E.g. the most important a airs in newspapers are announ ed with banner
headlines whereas the less important notes are written with a small font size.
 The headings and labels that denote ertain information are highlighted using
some ombination of the above means.
In ase of the world wide web do uments, these visual instruments are used even more
intensively that in the traditional media. Sin e the Internet is usually used for qui kly
and e e tively obtaining some information, the do uments must provide great amount
of visual ues that navigate the reader. Most often, these ues have the form of high-
lighted headings and labels that denote the meaning of ea h part of the do ument.
In this hapter, we propose an information extra tion method that is based on
modeling the visual information in the do ument that is intended to be used by the
readers for qui k navigation. First, we de ne the method for pro essing a single HTML
do ument. Later, in se tion 5.7, we extend the method to logi al do uments formed
by multiple HTML do uments. As stated in the introdu tion, in this thesis we fo us
on the data identi ation phase of the information extra tion pro ess.
5.1 Proposed Approa h Overview
The prin iple of the proposed approa h is shown in the gure 5.1. In ontrast to the
traditional wrapper approa h ( gure 3.2), the information extra tion pro ess onsists
of multiple steps.
As a rst step, we analyze the HTML do ument that ontains the data to be
extra ted and we reate a model of the visual information as it is expe ted to be
presented in a standard web browser. This model onsists of two separate omponents
{ the model of the page layout and the model of the visual features of the text.
As the next step, we transform these two omponents to an uni ed model of the
do ument logi al stru ture. This model des ribes the do ument ontent on signi antly
more abstra t level. As de ned in se tion 3.3.3, the logi al stru ture only des ribes hier-
ar hi al relations among individual parts of the do ument ontent. When we represent
the text ontent of the do ument as a text string, (omitting the embedded HTML tags),
the resulting model of the logi al stru ture is a tree where ea h node ontains a sub-
string of the text of the do ument and the edges represent relations between a superior

32
HTML Document Structured query

HTML code analysis

Page layout model Text features model


Visual information
Model transformation

Subtree matching

Logical document structure

Extracted data

Figure 5.1: Visual modeling of an HTML do ument

or more general part of the text (e.g. a heading) and the inferior, more on rete part
(e.g. the hapter ontents or a data value). More detailed spe i ation an be found
in se tion 5.5.
After reating the logi al stru ture model, the next step is the information extra -
tion itself. Sin e the logi al stru ture model is a tree of text strings, the information
extra tion task an be formulated as a problem of lo ating a parti ular tree node that
ontains the desired information. In our method, we propose de ning the information
extra tion task as a template of a subtree of the logi al do ument stru ture. After this,
the information extra tion task onsists of lo ating all the subtrees of the logi al stru -
ture model that orrespond to the spe i ed template. Ea h subtree found orresponds
to a data re ord in the extra ted data.
In following se tions we give a detailed des ription of the individual steps.
5.2 Visual Information Modeling
As dis ussed in previous se tions, the do uments in the World Wide Web have more
or less hierar hi al organization so that the user an e e tively lo ate the desired in-
formation in the do ument. This hierar hi al organization is expressed by two basi
means:
1. By splitting the do ument to several visually separated parts that an be arbi-
trarily nested; i.e. the page layout.
2. By providing a hierar hy of headings and labels of di erent level of abstra tion

33
HTML Document

Page layout model Text features model

Logical document structure

Figure 5.2: Visual modeling of an HTML do ument

that des ribe the ontents of a part of the do ument or the meaning of a par-
ti ular data presented in the do ument. This hierar hy is expressed by various
typographi al attributes of the text; e.g. the more important is the heading, the
larger font size is used, et .
In order to obtain a logi al stru ture of a do ument we have to reate the models of
both the omponents of the visual information: the page layout and the typographi al
attributes and to transform these models to the model of the logi al do ument stru ture
as shown in Figure 5.2.
5.2.1 Modeling the Page Layout

In order to learly distinguish di erent kinds of information in the do ument, the web
pages are usually split to multiple areas. Figure 5.3 shows a typi al example of a
do ument split to several visual areas. We an noti e three basi visual areas { the
header on the top of the page, the left olumn with a navigation menu and the main
part that arries the informational ontents of the page. The main part is further split
by a horizontal separator in two parts { the main ontents and the footer. We an see
that some areas are nested and thus there is a hierar hy of visual area present in the
do ument.
Generally, the visual areas in a do ument an be visually expressed using various
visual separators (horizontal rules, boxes, et .) or by di erent visual properties; the
most usual one is di erent ba kground olor. In HTML, it is not diÆ ult to enumerate
all the possible means that an be used for reating a visually separated are in a
do ument. As follows from the HTML spe i ation [54℄, the set of available means
is quite limited. To be spe i , su h an area an be reated using following HTML
onstru tions:

34
Figure 5.3: Visual areas in a do ument

 Do ument. The whole do ument an be onsidered as a visual area that always


forms the root node of the visual area hierar hy.
 Tables and table ells. Ea h ell of a table forms an area that an have its own
visual attributes. The table forms an area that holds all the ells and optionally
a table aption.
 Lists and list items. A list item itself forms a visually separated area, a list an
ontain multiple items.
 Paragraphs. Paragraphs are usually not used for reating separated areas but
they an be interpreted that way when used together with CSS and the visual
attributes of a paragraph are suÆ iently di erent from the attributes of pre eding
and following text.
 Generi areas. HTML provides a generi area <div> that has no in uen e to the
visual attributes itself. However, it is often used together with CSS that allows
to spe ify the style and position of the area.
 Frames. HTML frames allow to ombine multiple HTML do uments in a page.
Ea h of the do uments forms an area.
 Horizontal rule. The <hr> element annot be used for reating a standalone visual
area but it allows to insert a horizontal line that splits an area in two parts.
A omplete list of the related page obje ts with the appropriate HTML tags is given
in Table 5.1.

35
Page obje t HTML tags
Do ument <html>
Table, table ell <table> <th> <td>
List, list item <ul> <ol> <dl> <li> <dt> <dd>
Paragraph <p>
Generi area <div>
Frames <frameset> <frame>
Horizontal rule <hr>
Table 5.1: Visual areas in HTML and orresponding tags

For the purpose of modeling the page layout we de ne the notion of visual area as
any area in the do ument that is formed by one of the mentioned means independently
on its visual attributes. For example, we onsider ea h table ell a separate visual area
independently on whether the surrounding table ells have di erent ba kground olor or
whether they are separated by a bounding box. The visual areas in a do ument form
a hierar hy where the root represents the whole do ument and the remaining nodes
represent the visual areas in the do ument that an be possibly nested. For modeling
the page layout, we assign ea h visual area an unique numeri identi er 2 , where vi I

the whole do ument has 0 = 0 and the remaining areas have = 1 + 1. Then, the
v vi vi

layout of the page an be modeled as a tree of area identi ers as shown in the gure
vi

5.4.
v0

v1 v5
v2 v3 v6

v4

Figure 5.4: Example of the page layout model

Formally, the model of the page layout an be denoted as a graph:


=( )
Ml Vl ; El (5.1)
where = f0 1
Vl ; 1g is the set of all area identi ers that form the nodes of the
;:::n

tree and ( ) 2 i and are the visual area identi ers and the area identi ed
vi ; vj El vi vj

by is nested in the area identi ed by .


vj vi

This model doesn't ontain any information about the visual attributes of the areas,
it only represents the way in whi h the visual areas are nested. However, by assigning
individual parts of the do ument text to the visual areas we an obtain the information
about what parts of the text are related. The model thus represents the most important
information the page layout gives to the reader as dis ussed above.

36
Figure 5.5: Visual attributes of the text

5.2.2 Representing the Text Features

Various visual attributes of ertain portions of the text are often used for emphasizing
the importan e of the parti ular text portion or, on the ontrary, to suppress its im-
portan e. Another ommon use of the visual attributes is to distinguish the headers
and labels in a do ument from the remaining text. An illustration of this fa t is given
in the gure 5.5. In the text of the rst se tion, the words \weight" and \markedness"
are emphasized by using itali s. Similarly, the word \weight" in the se ond se tion is
emphasized by using the bold font. There are two levels of headings in the do ument
that are distinguished by the font size. The rst-level heading gives the title of the
do ument whereas the two se ond-level headings are used for the titles of the se tions.
Furthermore, there is a label \Number of ars sold" that introdu es the value of \125".
The fun tion of the label is indi ated by the trailing olon.
For the purpose of automati do ument pro essing, we need to reate a model of the
do uments des ribing the mentioned features of the text. This model has to des ribe
these features on an abstra t level. For example, we want to des ribe the fa t, that
the word \weight" is emphasized in both o urren es. The used mean, i.e. if the
emphasizing is a hieved using the bold font or itali s is not signi ant from our point
of view. The same e e t an be a hieved for example by typing this word in red olor.
Similarly, the font size used for the word \Introdu tion" is not important, we want to
des ribe that this heading has a lower level than the main do ument heading but it is
at higher level that the remaining text.
An HTML do ument onsists of the text ontent and the embedded HTML tags.
Some of the tags an spe ify the typographi al attributes of the text. Let's denote
Thtml a set of all possible HTML tags and an in nite set of all possible text strings
S

between ea h pair of subsequent tags in a do ument. Then, an HTML do ument D

37
an be represented as a string of the form
= 112233 D +1
T s T s T s (5.2)
: : : Tn sn Tn

where 2 is a text string with the length j j 0 that doesn't ontain any embedded
si S si >

HTML tags and ,  0 is the number of su h strings in the do ument. , 1  


n n Tj j

n + 1 is a string of HTML tags of the form


= 1 2 3 Tj tj; tj; tj; (5.3)
: : : tj;mj

where 2 tj;k and j j  1.


Thtml Tj

Sin e the visual attributes of the text an only be modi ed by the HTML tags,
ea h text string has onstant values of all the attributes. Let's de ne the notion of
si

text element as a text string with visual attributes. Ea h text element 2 , where ei E

E =    . As usual, we write as
S I I I ei

=( ) ei si ; vi ; xi ; wi(5.4)
and we de ne
1   1 +1 
ei < ej ; i n (5.5)
;i j n

where 2 ,si S 2 . is a text string that represents the ontent of the


vi ; xi ; wi I si

element, is the identi er of the visual area the element belongs to and and
vi xi wi

are the element and


markedness whi h present a generalization of the visual
weight

attributes of the text string as de ned further.


From this point of view, the whole text of the do ument with visual attributes an
be expressed as a string of the form
= 123 Mt e e e (5.6)
: : : en

where = (ei ), 1   are the text elements. orresponds to the


si ; vi ; xi ; wi i n si

appropriate text string in (5.2). is determined during the tree of visual areas is being
vi

built. The element 1 always ontains the do ument title as dis ussed further in se tion
e

5.4.
The Markedness of a Text Element
The visual appearan e of a text element an be determined by interpreting the HTML
ei

tags in ,  and orresponding CSS styles (this is basi ally what the web browsers
Tj j i

do). The markedness of an element determines how mu h the element is highlighted in


the do ument. In order to determine the element markedness, we analyze and interpret
following visual attributes of the text element:
 Font size. The font size is used to distinguish important parts of the text es-
pe ially headers. The relation between the font size and the visual markedness
is straightforward { the greater is the font size the more expressive is the text
element. The normal text of a do ument is usually written in some default font
size. Small font size an be used for writing some additional information (e.g.
opyright notes et .)

38
 Font weight. Although more degrees of the font weight an be distinguished, from
the point of view of a reader we an distinguish two of them { normal and bold.
The text element written in bold is always more visually expressive as a normal
text with the same attributes.
 Font style. The font style an be normal or itali . It is also possible to use the font
style alled oblique or slanted. This style is however very similar to itali and it is
usually not distinguished by the readers even nor by some web browsers. Writing
the text element in itali s is often used to rea h its higher visual markedness.
 Text de oration. The text element an be underlined in order to in rease its
markedness or, on the ontrary, striked-out to de rease its markedness. Overlined
elements are usually not used and there is no lear interpretation of overlining.
 Text olor. The do ument text is usually written in one olor. Di erent olors
an be however used for highlighting some parts of the text. A ording to this,
any text element written in other than default olor higher visual markedness.
Based on the above visual properties, we de ne following heuristi for omputing
the value of markedness of a text element:
= (   + + + + )  (1 )
x F f b o u z (5.7)
where , , , and have the value 1 when the text element is bold, oblique, un-
b o u z

derlined, olor highlighted or striked-out respe tively and 0 if they are not.  is the
f

di eren e f where is the font size of the element and is the default font size
fd f fd

for the do ument. The onstant de nes the relation between the text size and its
F

markedness. For 4 the element with greater font size is always more important
F >

than the element with lower font size. This orresponds to the usual interpretation.
Element Weight
Some of the text elements are used as headings or labels that des ribe the ontents of
a orresponding se tion of the do ument on a higher level of abstra tion. The weight
of the element des ribes the position of the element in the hierar hy of headings. The
main title of the do ument has the highest weight while the normal text has the weight
of zero. For determining the element weight we analyze following fa tors:
 The markedness of the element. The headings are usually written in larger font
or at least in bold. The more expressive is the text element the higher should be
its weight.
 Element position. The position of the element is riti al for de iding if the element
is a heading or it is not. An element that lies in the ontinuous blo k of the text
annot be onsidered as a heading. For example the word \weight" in the gure
5.5 annot be onsidered as a heading even if it has greater markedness that the
surrounding text. Generally, we an say that the must be pla ed at least at the
beginning of a line. An element is pla ed at the beginning of a line i any of
en

the tags in auses a line break. Su h tags are de ned in see [54℄.
Tn

39
 The pun tuation an be used for denoting the term-des ription or
Pun tuation.
property-value pairs of elements. The elements that ontain the text ending with
the olon should have higher weight than the same element that doesn't end with
the olon.
Considering these fa tors we an de ne the weight of a text element based on a
heuristi similar to the de nition of markedness:
h
= (   ) + ( + + + )  +   (1 )
w F f b o u
i l W p (5.8)
z

where ,  , , , , and have the same meaning as in the markedness de nition


F f b o u z

(5.7), and have the value 1 when the element follows a line break and when the
l p

element text ends with a olon respe tively and is the weight of the nal olon.
W

An element that ends with a olon should have higher weight that possibly following
element written in bold, underlined or highlighted by olor. This ondition holds for
W > 4.
5.3 Representing the Hypertext Links
As it follows from the hypertext nature of the world wide web, ea h do ument an
ontain links to another do ument. These links don't form part of the text ontent of
the do ument but they an be spe i ed using a spe ial HTML tag <a>. Therefore, the
links are not dire tly in luded in the visual information model1, they may be however
useful for the dis overy of the logi al do uments that is going to be dis ussed below in
se tion 5.7. For this purpose, it is ne essary to represent the in luded hypertext links
in some way.
In HTML, the link is assigned to a ontinuous portion of the HTML text usually
alled an hor. This text is visually distinguished in the do ument and it is assigned a
target URI that points to another do ument. In our do ument representation, the text
of the do ument onsists of text elements. Thus, we an de ne a representation of a
link as
=( l) uri; T (5.9)
where is the target URI of the link and is a non-empty set of text elements that
uri T

form the an hor. Sin e the an hor must be ontinuous, always ontains one or more
T

subsequent text elements. For ea h do ument, we an reate a set


=f1 2 L g
l ; l ; : : : ; ln (5.10)
that ontains all the links in the do ument.
5.4 HTML Code Analysis for Creating the Models
The information ne essary for reating both omponents of the visual information
model { the page layout model and the text features model must be obtained from
1
Only the visual aspe t of the links is onsidered as dis ussed below in se tion 5.4

40
the HTML ode of the do ument. The pro ess of obtaining the ne essary data an be
split to following steps:
1. Obtaining of the do ument(s)
2. Tag to CSS onversion
3. Visual area and text element identi ation
4. Unit uni ation
5. Computation of the markedness and weight
The way in whi h the do uments are obtained is not signi ant for the do ument
modeling. Usually, the do uments are retrieved through a network using the HTTP
proto ol. For the analysis, we have to retrieve the HTML do ument itself and all the
eventual style sheets referen ed in the do ument. When the frames are used, it is also
ne essary to obtain the ode of all the frames.
As the next step, we go through the HTML ode and we onvert all the tags
that modify the values of the visual attributes to CSS spe i ations. For example,
we remove all the <b> and </b> pairs from the do ument ode and we repla e it
with <span style="font-weight: bold">. Similarly, some visual attribute an be
in uen ed by multiple CSS properties or by di erent spe i ation methods (absolute
or relative value, et .). In this phase, we onvert the spe i ation using HTML tags
and CSS properties to a single CSS property for ea h visual attribute. For ea h tag
en ountered, we he k also the do ument-wide style spe i ation that may be also
in uen ed by the id or lass attributes of the tag. The resulting CSS style de nitions
are inserted ba k to the do ument using the style attribute of the generi <span> and
<div> tags. Various methods and units of the value spe i ation are uni ed later, in the
unit uni ation phase. Table 5.2 shows a list of all HTML tags and the orresponding
CSS properties that in uen e the values of individual visual attributes of the text. We
onvert the spe i ations as follows:
 Headings are onverted to a text style with the property font-weight: set to
bold and the font-size: property is set a ording to the level of the heading.
A ording to the standard behaviour of the web browsers, the headings of level
1 to 6 have the font size set to xx-large, x-large, large, medium, small and
x-small.
 Table headers spe i ed using the <th> tag have the property font-weight: set
to bold.
 Links, i.e. the text that is used as a hypertext link to another do uments have
the property font-de oration: set to underline.
 Font spe i ations using the <font> tag an be used for modifying the font fa e,
size and olor for a part of the text. The font size de nitions is onverted to the
font-size: property spe i ation in CSS and the olor attribute is onverted to
the olor: property. We don't deal with the font fa e in our method.

41
Attribute CSS properties HTML tags
Size font-size:, <big>, <font>, <h1> - <h6>,
line-height: <small>
Weight font-weight: <b>, <h1> - <h6>, <th>,
<strong>
Style font-style: <em> <i>
De oration text-de oration: <s>, <strike>, <u>
Color olor: <font>
Table 5.2: Text Visual Attributes

 Bold text spe i ed using the <b> and <strong> tags is onverted to font-weight:
boldspe i ation.
 Itali s spe i ed using the <i> and <em> tags are onverted to font-style:
itali .
 is spe i ed using the text-de oration: property
Underlined and overlined text
set to underline or line-through respe tively.
The visual properties of some page obje ts su h as headings and table headers are not
exa tly de ned in the HTML or CSS spe i ation. The resulting visual appearan e of
these elements depends to ertain extent on the used web browser and its on gura-
tion. The onversion rules listed above have been observed on popular web browsers
Mozilla2 version 1.6 and Konqueror3 version 3.2.2. Espe ially the rst one is om-
monly onsidered to implement all the web standards and re ommendations the most
pre isely. Anyway, sin e ertain variety in interpreting the HTML ode by di erent
web browsers shouldn't in uen e the do ument understanding by the user, the men-
tioned rules present one of the possible variants that is however a eptable and de fa to
standard.
The next phase onsists of dis overing the visual areas and the text elements in the
do ument. These two tasks are done simultaneously during single-pass analysis of the
HTML ode. The output of this phase is
1. The tree of visual area identi ers that orresponds to the de nition (5.1).
Ml vn

2. A string of styled elements = 1 2


Ms .
es ; es ; : : : esn

A styled element is basi ally a text string with an assigned visual area identi er and
the style. The style an be represented as a tuple
=(
style )
f size; f weight; f style; f de oration; olor (5.11)
where the individual elements orrespond to the appropriate CSS properties. Then, a
styled element an be de ned as
=(
esi )
si ; vi ; stylei (5.12)
2
http://www.mozilla.org
3
http://konqueror.kde.org

42
where is the text ontents of the element, is the identi er of the assigned visual
si vi

area and is the style of the element.


stylei

During the HTML ode analysis, we read subsequently the HTML tags from the
ode of the do ument. For reating the tree of visual area identi ers , we maintain
Ml

the sta k of urrently open visual areas. At the beginning, ontains the identi er
Sv Sv

v 0 of the root area and Ml ontains only 0 as the root element. When a tag is read
v

that implies the start of a new area (any tag from the table 5.1), a new area is open,
it is assigned a new identi er . This identi er is added to as a son node of the
vi Ml

identi er being urrently on the top of and afterwards, is added to the top of .
Sv vi Sv

When the orresponding losing tag is en ountered, the area identi er on the top of
the sta k is removed from the sta k.
For determining the visual attributes of the elements we maintain a sta k of the
Ss

styles. At the beginning, the sta k ontains a default style. When any opening tag is
en ountered in the ode, the top of the sta k is dupli ated. The style on the top of the
sta k is updated a ording to the available style de nitions for the tag being pro essed.
When a orresponding losing tag is en ountered, the top of the sta k is removed.
Ss

The style of the element an be spe i ed by any CSS spe i ation (see the overview in
se tion 2.2). This spe i ation is parti ularly important for the hypertext links, that
are usually underlined by default but any other visual e e t an be spe i ed using CSS
de nitions for the <a> tags.
Any non-empty text string en ountered between two subsequent tags forms a new
styled element. The rst styled element 1 is always formed by the do ument title
es

spe i ed by the <title> tag. If no title is spe i ed the element ontains an empty
string 1. The weight of the title element should be always greater than the weight of
s

any other text element in the page. Ea h new element is assigned the identi er of the
visual area that is urrently on the top of the sta k of visual areas and the style
Sv

that is urrently on the top of the sta k . Then, the new styled element is appended
Ss

to the resulting string of styled elements .


Ms

In the unit uni ation phase, the values of size and the olor of the style of all the
styled elements from are onverted to a uni ed form. In CSS, the font size an be
Ms

spe i ed by an absolute value or relatively or by pre-de ned keywords (e.g. small or


x-large) and various units an be used. The text olor an be spe i ed by pre-de ned
keywords or by the per entage of red, green and blue olor that an be written various
ways. During the uni ation, all sizes are onverted to an absolute value in points (pt)
and the olors are onverted to the string #rrggbb where rr, gg and bb mean the value
of the red, green and blue in hexade imal from 00 to FF.
In the nal phase we reate the model of text visual features as de ned in 5.6
Mt

from by omputing the values of markedness and weight for all the styled elements.
Ms

For ea h styled element 2 we reate a new text element by opying the values
es i Ms ei

of and and omputing the values of the markedness a ording to the de nition
si vi xi

(5.7) and the weight a ording to the de nition (5.8). Then, the new element is
wi ei

appended to . Mt

43
5.4.1 Tables in HTML

In HTML do uments, the tables an be used in two basi roles:


 As a standard instrument for presenting the stru tured, tabular data
 For a hieving desired page layout
The use of the tables for a hieving some layout is depre ated by the HTML spe -
i ation [54℄, the as ading style sheets should be used instead. Using tables for this
purpose auses many problems su h as slower page rendering and display problems on
non-visual media. However, it is very frequent in urrent world wide web to use the
tables this way. For the HTML ode analysis, this use of the tables orresponds to the
analysis method presented above. Ea h table ell is interpreted as a separate visual
area as well as the whole table itself. The text ontent is pro essed independently on
the table or ell tags; only the visual area identi er is assigned to ea h text element.
Stru tured tables that are used for presenting some tabular data must be handled
as a spe ial ase in the HTML ode analysis. As in a HTML do ument always holds
that the order of the text elements in the do ument ode orresponds to the order of
the elements in the displayed do ument, tables introdu e a two-dimensional stru ture
where the logi al order of the elements depends on the organization of the table and
it an be interpreted various ways. Moreover, the tables an ontain a hierar hy that
results from the relation between the table header and the rest of the table [62℄.
Personal data
Name
John Smith
E−mail
john@johnsmith.cz

Figure 5.6: An example of a stru tured table


An example of su h a table is in the gure 5.6. The ontents of table ells in
HTML are always de ned row by row so the order of the elements in the HTML ode
is \Personal Data", \Name", \E-mail", et . However, a reader interprets su h a table
as a hierar hy shown in the right part of the gure. It this ase, the hierar hy of visual
areas that orrespond to the table ells is not reated by the HTML ode only, but the
relations among the header and non-header ells must also be onsidered. Let's assign
numbers from 1 to 5 to the visual areas that orrespond to the table ells as shown
in the gure 5.7 and let's onsider that the whole table forms a visual area 0. Then,
using the HTML pro essing method spe i ed above, the hierar hy of the visual areas
orresponds to the left tree in the gure 5.7. For the reader, however, the hierar hy of
the visual areas appears the way orresponding to the right tree in the gure 5.7. For
this reason, a spe ial algorithm must be used for pro essing the stru tured tables.
When a table is en ountered in the HTML ode, the rst step is to de ide whi h
are the header ells. HTML allows to distinguish between the header ells and the data

44
0

1
0 2 3

1 2 3 4 5 4 5

Figure 5.7: Visual areas in a stru tured table

ells using the <th> or <thead> tags, this feature is however not ommonly used and
in some ases it is used in orre tly. Therefore, we propose determining the headers by
omparing the visual appearan e of the table ells:
 We assume that the header is formed by the top row or the leftmost olumn of
the table.
 All the ells in the header must have onsistent visual style: font, olor and
ba kground.
 When a header row or olumn is en ountered, we add ea h header ell to the tree
of visual areas as hild nodes of the table visual area. After this, we onsider ea h
part of the table that is overed by a single header ell as a separate subtable, that
is formed by one or more rows (when the header ells are in the rst olumn)
or olumns (headers in the rst row) and we repeat the header identi ation
pro ess re ursively on all the subtables. The pro ess ends when no headers an
be identi ed in the subtables. In this ase, we assume that the remaining ells
only ontain the data values.
At the end of this pro ess we obtain a hierar hy of headers. In the next step, we add
the all the ells of the remaining subtables to the visual model as the hild nodes of the
respe tive header ell they belong to. An illustration of this pro ess is in the gure 5.8.
In steps 1 and 2 we dete t the headers of the (sub)tables. In step 3, no more headers
an be dete ted and the data ells are just in luded to the tree.
5.4.2 Example of Visual Models

In order to demonstrate the HTML parsing pro ess, let's onsider the simple do ument
shown in the gure 5.9. The HTML tags that in uen e the visual appearan e of the
text elements are shown in the shaded boxes.
The root visual area is formed by the do ument. The only obje t that introdu es
sub-areas to the do ument is a simple table. In this table, a stru ture an be dete ted.
The resulting tree of visual areas is shown in the gure 5.10.
Table 5.3 shows the visual attributes of all the text elements in the page and the
values of markedness and weight omputed from the de nitions (5.7) and (5.8) for = 5 F

and = 5. As stated in the se tion 5.2.2, the rst element 1 is always the title of the
W e

do ument spe i ed using the <title> tag.

45
0
Step 1 1

Step 2 1
2 3

1
Step 3 2 3

5 6

Figure 5.8: Example of the table stru ture dete tion


The resulting model of text visual attributes is the string of tuples ( ) of the
s; v; x; w

orresponding values in ea h line of the table. Note that the visual area identi er of v

ea h text element orresponds to visual areas in the gure 5.10.


5.5 Logi al Stru ture of a Do ument
The next step of the proposed information extra tion method onsists of reating the
model of logi al do ument stru ture based on the visual information gathered and
modeled in previous steps. As de ned in the introdu tion, logi al do ument stru ture
is a hierar hy of the text elements in a do ument as it is interpreted by a reader. This
hierar hy should respe t the logi al relations among the text elements as the follow from
the visual appearan e of the text elements independently on the parti ular HTML ode.
As an illustration, gure 5.11 shows the logi al stru ture of our sample do ument from
the previous se tion.
Formally, the logi al do ument stru ture is an ordered tree of text elements. It an
be denoted as a graph
=( S )
VS ; ES (5.13)
where = f 1 2
VS g is an ordered set of all the text elements in the do ument
e ; e ; : : : en

where 1 pre edes 81


ei ei = . The set of edges of the tree ontains the tuples
< i < n ES

( ) su h that 2 and 2 and is dire tly superior to a ording to the


ei ; ej ei VS ej VS ei ej

interpretation of their visual attributes and their position in the do ument. In order
to ensure that is a tree, it must hold that ea h text element ex ept 1 whi h is the
S e

page title has exa tly one dire tly superior element (its parent element in the tree).
The element 1 always forms the root node of the tree.
e

46
Figure 5.9: Sample do ument

The set of edges of the tree is derived from the page layout model (5.1) and
ES Ml

the model of typographi al attributes of the text (5.6) by an algorithm that onsists
Mt

of following two phases:


1. Creating the set of graph edges su h that = (
EV SV ) is a tree of text
VS ; EV

elements and for any 2


ei ; ej is an an estor of i the orresponding visual
VS ei ej

area identi er is an an estor of in the tree of visual areas . We all a


vi vj Ml SV

frame of the logi al do ument stru ture.


2. Creating by opying all the edges ( ) from and repla ing by one
ES ei ; ej EV ei

of its des endants if needed, so that the element of a higher weight is always an

Figure 5.10: Visual areas in the sample do ument

47
s  f b o u z l p v x w

Stru ture 0 0 0 0 0 0 0 0 0 0 0
Sample do ument 8 1 0 0 0 0 1 0 0 41 41
This is the beginning. 0 0 0 0 0 0 1 0 0 0 0
Text 5 1 0 0 0 0 1 0 0 26 26
Various text follows: 0 0 0 0 0 0 1 1 0 0 5
bold 0 1 0 0 0 0 0 0 0 1 0
itali s 0 0 1 0 0 0 0 0 0 1 0
and underlined 0 0 1 1 0 0 0 0 0 2 0
There an also be 0 0 0 0 0 0 1 0 0 0 0
some link 0 1 0 0 0 0 0 0 0 1 0
. 0 0 0 0 0 0 0 0 0 0 0
Table 5 1 0 0 0 0 1 0 0 26 26
Some tabular data: 0 0 0 0 0 0 1 1 0 0 5
Name 0 1 0 0 0 0 1 0 4 1 1
John Smith 0 0 0 0 0 0 1 0 6 0 0
E-mail 0 1 0 0 0 0 1 0 7 1 1
johnjohnsmith. om 0 0 0 1 0 0 1 0 9 1 1
Table 5.3: Text elements in the sample do ument
an estor of all the elements with a lower weight within the visual area.
The algorithms for both steps follow.
Algorithm 1 Creating a frame of the logi al stru ture
Input: VS=f 1 2 g { the set of text elements
e ; e ; : : : en

Ml{ page layout model


Output: SV=( ) { a frame of the logi al stru ture
VS ; EV

urrent = 1e

For ea h = 2 3
ei ; =(
e ; e ; : : : en ei ) do
si ; vi ; xi ; wi

if 6= 1 then
vi vi

if des endant of 1 then


vi vi

= 1
urrent ei

else
is the nearest su h that
urrent ej

is an estor of 1 in
ej ei SV and
is an estor of in
vj vi Ml

Add ( ) to
urrent; ei EV

Algorithm 2 Applying the element weight


Input: SV =( ) { a frame of the logi al stru ture where
VS ; EV

48
Figure 5.11: Sample do ument with the orresponding logi al stru ture

ep 2 VS is a root of SV

1 2 2V
e ; e ; : : : e n are hild nodes of ;
S ep

e 1 < e 2 < : : : < e

=( ) { the logi al stru ture tree


n

Output: S VS ; ES

if n > 0 then
urrent = ep

Add ( ep ; e 1 ) to E
= e 2; e 3; : : : e
; =( ) do
S

For ea h e e i si ; vi ; xi ; wi

Re ursively apply algorithm 2 for =


i n

ep e i

if 1 then
wi < wi

= 1
urrent e i

if 1 then
wi > wi

= nearest su h that
urrent e j

is an estor of 1 and
e j e i

is an estor of and
ep e j

w j > wi

when su h doesn't exist then


e j = urrent ep

Add ( ) to
urrent; e i ES

The resulting model of the logi al stru ture exhibits an important feature that an
be des ribed as follows. Let be the whole text of an HTML do ument that an be
text

obtained by taking the ontents of the <body> tag of the do ument and dis arding all
the embedded HTML tags. Let 1 2 be a string of text elements obtained by a
e ; e ; : : : en

pre-order traversal of the resulting tree and let be the text string ontained in the
S si

49
text element . Then, it holds that
ei

text =
Y n

si (5.14)
i =2
In other words, when on atenating all the text strings ex ept the rst one (the title)
in the logi al do ument stru ture during the pre-order traversal of the tree, we obtain
the omplete text of the do ument.
5.6 Information Extra tion from the Logi al Model
As de ned in the previous se tion, the logi al stru ture of a do ument is a tree whose
nodes are formed by all the text elements ontained in the do ument. When using the
logi al stru ture model for information extra tion, the task is to lo ate the nodes of the
tree that ontain the desired information. We assume that the do ument is designed to
be understandable by a user and we analyze the way in whi h the visual information
is used by the user when looking for a parti ular data in the do ument.
The rst assumption is that the user usually knows an approximate form of the
information he is looking for. For example, when looking for a pri e, we are trying to
nd a number pre eded or followed by a ode of urren y. The se ond assumption is
that there is a hierar hy of headings and labels provided by the author of the do ument
that allows the user to interpret the do ument ontents e e tively. The user is thus
provided with three types of navigational ues:
1. The relations among the text elements in the do ument that orrespond to the
logi al do ument stru ture and that are expressed using the visual design of the
do ument.
2. The labels provided by the author of the do ument.
3. The expe ted format of the desired information.
When looking for a parti ular information, the user typi ally navigates through
the logi al stru ture of the do ument as suggested by the visual means and tries to
hoose the path leading to the desired information by lassifying the headings and
labels that in our ase orrespond to the nodes of the stru ture tree. Normal user an
use the natural language understanding for hoosing the best path. Sin e the natural
language pro essing is a non-trivial task, we propose a simpli ed solution based on
regular expressions and approximate tree mat hing algorithms.
Let's return ba k to the original idea of information extra tion from data-intensive
do uments that is based on the extra tion of the omplete data re ords rather than
on extra ting the single data values. Further we will assume that ea h data value,
when it is present in the do ument, forms exa tly one text element. This assumption
is somewhat simplifying be ause it doesn't allow the situations when the information
is split to several text elements or, on the ontrary, when there is an additional text
present in the same text element. However, as follows from the text element de nition
(5.4) on the page 38, this assumption orresponds to the following situation:

50
 The data value is visually onsistent (thus it forms a single text element)
 The data value is visually separated from the surrounding text by any HTML tag
so that the surrounding text doesn't form part of the text element
In the data intensive do uments, the a eptan e of these rules an be expe ted. The
impa t of this simpli ation to the performan e of the information extra tion method
is dis ussed in the evaluation part of this thesis.
Let's assume that ea h data re ord to be extra ted onsists of j j data values (an
r r

example of su h a data re ord is given on the page 3). When we admit that some
data values an be missing in the do ument, when extra ting the data re ord, the
task is to lo ate text elements in the logi al do ument stru ture that orrespond to
m

the appropriate data values;   j j where is the minimal number of values


N m r N

that must be lo ated for onsidering the data re ord to be found. All the lo ated text
elements that form a single data re ord are lo ated in ertain subtree of the logi al
stru ture tree. Aside from the data values, this subtree ontains the text elements that
orrespond to the labels that denote the value in the do uments and some remaining
text elements that an ontain additional notes et .
From this perspe tive, the information extra tion task an be viewed as a task of
lo ating the subtrees of logi al stru ture tree that meet following requirements:
 The subtree ontains the data values of the expe ted form and the expe ted labels.
 The data values are logi ally related to the appropriate labels; i.e. the text ele-
ment ontaining the potential data value is a des endant of an element ontaining
the appropriate label.
For lo ating the appropriate subtrees of the logi al stru ture tree, we propose an
approa h based on tree mat hing algorithms.
5.6.1 Using Tree Mat hing

The proposed approa h is based on the spe i ation of a template of a subtree that is
to be lo ated. This spe i ation onsists of two steps:
1. We spe ify the expe ted logi al stru ture of the extra ted data re ord
2. We add the information about the expe ted labels and the format of the data
values. Regular expressions are used for this spe i ation.
The result of this spe i ation is a template of a tree that an be interpreted as a
stru tured query to the database of all the subtrees of the logi al stru ture tree.
Let's onsider a simple example in the Figure 5.12. From a personal page, we
want to extra t the name, department and the e-mail address of that person. The
left tree de nes an expe ted logi al stru ture of the extra ted information. In ase
of the personal pages, the name of the person is usually presented as a superior text
(heading) and the remaining data is pla ed further in the do ument. We extend the tree
by repla ing the elds with the regular expressions that denote their expe ted format

51
^[a−zA−Z\ \.]+$

Name
[Dd]epartment [Ee]−?mail

Department ^[a−zA−Z\ \.]+$

E−mail ^[A−Za−z0−9_\.]+@[A−Za−z0−9_\.]+$

Figure 5.12: An example extra tion query with regular expressions

and we add the expressions that denote expe ted labels of these data. The resulting
tree an be viewed as a stru tured query and tree mat hing algorithms an be used for
identifying all mat hing subtrees of the logi al do ument stru ture.
For simple extra tion tasks, we an simply estimate the logi al stru ture of the
data re ords as shown on the previous example. For more omplex task, it is ne essary
to obtain at least one sample do ument in order to determine the logi al stru ture
of the ontained data re ords. Furthermore, there usually exist several variants of
stru turing the data re ords. For example, the previous example shown in the gure
5.12 ould be hanged so that the name is presented at the same level as the remaining
data. For this reason, it may be ne essary to modify the expe ted data stru ture when
pro essing a new set of do uments. By adapting the extra tion task for further sets of
do uments, we obtain several variants of the template subtree that over all possible
variants of presenting the extra ted data. However, in omparison to wrappers, where
the number of possible variants is extremely high and therefore the wrappers must be
re-generated for ea h set of do uments, in our ase, there are only a few variants of the
logi al stru ture and with the in reasing number of pro essed data sets, the frequen y
of the modi ations de reases rapidly. For example, for the sample task shown in the
gure 5.12, two variants of the task spe i ation are suÆ ient as shown in the method
evaluation part in se tion 7.1.1.
5.6.2 Approximate Unordered Tree Mat hing Algorithm

For tree mat hing, we use a modi ation of the path x algorithm published by Shasha
et al. [56℄. This algorithm solves approximate sear hing in unordered trees based on
the root-to-leaf path mat hing. The original problem is to nd all data trees from D

a database D that ontain a substru ture 0 whi h approximately mat hes a query
D

tree within distan e


Q . The distan e measure is de ned as the total number of
DI F F

root-to-leaf paths from that do not appear in 0. Let's onsider an example of the
Q D

query tree in Figure 5.13, whi h has two paths. The query tee mat hes data tree 1
Q D

with distan e of 0 (both paths from are present in 1 ). For the tree 2 , there are
Q D D

two possible mat hes with distan e of 1 (the path A-D is not present in the tree but
the path A-B has two o urren es in 2 ).
D

In order to use this algorithm for our purpose, we introdu ed following modi a-

52
Q D1 D2

C A
A
B A B A
B D
B D B C

Paths: A−B Paths: C−B Paths: A−B


A−D C−A−B A−A−B
C−A−D A−A−C

Figure 5.13: Path x algorithm illustration

tions:
 Instead of the database of trees we have only one tree (the logi al stru ture tree)
and we want to lo ate the subtrees (data re ords) that mat h the query tree. Thus
in our ase, the database D is formed by all the subtrees of the logi al do ument
stru ture with the root node that mat hes the root node of the query tree.
S

 The values in the nodes of the trees don't have to mat h exa tly sin e the query
tree ontains regular expressions in its nodes. We de ne, that a node in the
Q

logi al stru ture tree mat hes a node in the query tree i the appropriate
S Q

text element mat hes the regular expression in the query tree node.
 Sin e the exa t form of the logi al do ument stru ture may di er in some range
among the web sites, we have extended the path mat hing algorithm by intro-
du ing two more parameters:
{ QS K I P { number of nodes from the query path that don't mat h any node
from the mat hed path in . This orresponds to the situation that some
S

labels or data values that are expe ted by the tree template but they are
missing in the do ument.
{ M SK I P { number of nodes in the mat hed path from that don't mat h
S

any node in the query path. This orresponds to the situation that there
are some extra nodes in the logi al stru ture tree that are not expe ted by
the tree template.
These parameters allow to onsider the paths as they were mat hing event if there
is ertain number of missing or ex essive nodes.
The a ura y of the results depends on the values of the and
QS K I P M SK I P

parameters and on the allowed number of missing paths . More detailed analysis
DI F F

53
of the in uen e of the parameters to the pre ision and re all of the method is given in
Chapter 7.
5.7 Information Extra tion from Logi al Do uments
Analogous to the physi al do uments, the logi al do uments also exhibit a hierar hi-
al organization for similar reasons (see the dis ussion at page 25, se tion 3.3.3). In
ontrast to the physi al do uments, the hypertext allows to give the logi al do ument
any organization that an be modeled using a dire ted graph. On the other hand, the
organization must be easily understandable to the do ument users. From the pra ti al
point of view, onsidering this requirement, it is not advisable to use a more omplex
organization than the hierar hi al one. As observed by [60℄, when we extra t only
the links intended to be the routes through whi h the readers go forward within a
do ument, in most ases, we obtain a sequen e or a hierar hy of pages.
Let's represent the logi al do ument as a tuple
D

=( )
D Dp ; I (5.15)
where is a set
Dp

Dp =f 1 2 g
d ; d ; : : : ; dn (5.16)
where ;1   are the physi al do uments and 2 is the main (index) page. In
di i n I Dp

the ontext of using the information extra tion method des ribed in the above hapters,
the physi al do ument an be de ned as a tuple
di

di =( )
urii ; Si ; Li (5.17)
where is the uni ed resour e identi er of the physi al do ument, is the model
urii Si

of the logi al stru ture as de ned on page 46 and is the set of links de ned on the
Li

page 40.
The extension of our information extra tion method from physi al to logi al do -
uments is straightforward. We assume, that the user spe i es the URI of a do ument
when invoking the information extra tion pro ess. For dis overing the logi al do u-
ments, we adopt the method published in [60℄, whi h is also dis ussed in the se tion
3.3.2. The information extra tion pro ess onsists of following steps:
 We onsider the do ument whose URI has been spe i ed by the user the index
page of the logi al do ument. We use the method [60℄ for obtaining the URIs
I

of the remaining physi al do uments 2 . di Dp

 We reate the models of the logi al stru ture for ea h physi al do ument
Si di

separately using our method proposed in the above se tions. Simultaneously, we


reate the set of links for ea h .
Li di

 We reate a single model of the logi al stru ture for the whole logi al do ument
S

D by joining the logi al stru ture trees of the physi al do uments to a single
Si

tree.

54
When joining the logi al stru ture trees to the global model of the logi al stru -
Si

ture , we pro eed following way:


S

 Let the root node of (the logi al stru ture tree of the index page) be the root
I

node of . S

 Let's assume that the index page orresponds to a physi al do ument =


I dj

( ). For ea h link = (
urij ; Sj ; Lj l ); 2 we sele t the text element 2
url; T l Lj e T

that pre edes all remaining elements ontained in (see the de nition (5.4) on
T

the page 38). If the of the link orresponds to the URI of any do ument
uri

dk 2 , then we add the logi al stru ture tree to as a subtree of the


Dp Sk S

sele ted element . e

 We repeat the pro ess re ursively for ea h added subtree.


After nishing this pro ess, we have reated the model of the logi al stru ture for
the whole logi al do ument. This model an be used for information extra tion as
spe i ed in se tion 5.6. We an see, that this approa h abstra ts from the physi al
organization of the information to do uments by reating an uni ed model for all the
physi al do uments.

55
Chapter 6

Experimental System
Implementation
In order to evaluate the proposed information extra tion method in the real world
wide web environment, we have implemented an experimental system for extra ting
information from the data intensive do uments in the web.
6.1 System Ar hite ture Overview
General ar hite ture of our information extra tion system is shown in the gure 6.1.
The system onsists of four basi modules:
 Interfa e module implements an interfa e between the system and the Internet.
It is responsible to download the ne essary HTML do uments and to store them
for further analysis.
 Logi al do ument module implements the dis overy of logi al do uments. Given
a URI of the main page, it analyzes the hypertext links and reates a list of URIs
of the do uments that form the logi al do ument.
 Analysis module provides the analysis of the HTML ode in order to reate the
model of the visual information. Further, it implements the transformation of the
visual information model to the logi al stru ture model. Finally, it provides uni-
ation of the logi al stru ture models to the uni ed tree of the logi al stru ture.
 Extra tion module implements the information extra tion from the logi al stru -
ture model using the tree mat hing algorithms. The input is the user-spe i ed
extra tion template and the output is the extra ted data.
In order to rea h maximal exibility of the system and in order to be able to evalu-
ate the individual parts separately, the modules are implemented as standalone pie es
of software that ommuni ate wit h ea h other. The ommuni ation with the interfa e
module is simple { the module works as an HTTP proxy, so that the ommuni ation
is implemented using the HTTP proto ol. The ommuni ation among the remaining

56
Starting Logical document module
URI
Logical document
discovery

HTML
HTTP
documents
requests
URI list (XML)

HTML
HTML document HTML parser
Internet
repository
HTTP
Interface module
Visual information
analyzer

Logical structure
analyzer

Analysis module

Logical structure model (XML)

Extraction Tree matching Extracted


template data

Extraction module

Figure 6.1: System ar hite ture overview

57
<?xml version="1.0" en oding="iso-8859-2"?>
<!DOCTYPE logi al_do uments SYSTEM "ld.dtd">
<logi al_do ument url="http://www.xyz. z/">
<do ument url="http://www.xyz. z/index.php" />
<do ument url="http://www.xyz. z/aboutus.php" />
<do ument url="http://www.xyz. z/ onta t.php" />
</logi al_do ument>

Figure 6.2: XML representation of a logi al do ument

modules is based on the use of XML. This solution allows to a hieve maximal inter-
operability of the modules { ea h module an be used separately and the produ ed
output an be used for any task. This is parti ularly useful for representing the logi al
do ument stru ture, whi h an be used not only for information extra tion but also for
other tasks where the existing XML pro essing te hnologies su h as XSLT or XQuery
an be used. For this reason, we now des ribe the use of XML in our system in more
detail.
6.2 Using XML for Module Communi ation
The XML based ommuni ation is used in two points of the system. First, XML is used
for transferring lists from the logi al do ument module to the analysis module. Se -
ondly, the most important use of XML is the representation of the global model of the
logi al stru ture. Furthermore, XML is also used for the extra tion task spe i ation.
This appli ation of XML is dis ussed below in the se tion 6.3.4. Formal de nitions of
the used XML formats are given in appendix B.
6.2.1 Representing the Logi al Do uments

The data to be represented is formed by a simple list of URIs of the physi al do uments
that form the logi al do ument. We represent ea h physi al do ument by a <do ument>
tag with an attribute uri that ontains the URI of the physi al do ument. The root
tag <logi al do ument> en loses the list of physi al do uments. An example XML
representation of a logi al do ument is shown in the gure 6.2. The full DTD (Do -
ument Type De nition) of the logi al do ument XML representation an be found in
appendix B.2.
6.2.2 Logi al Stru ture Representation

Sin e the logi al do ument model is a tree of text elements, the XML representation is
straightforward. Ea h text element = (
e ) is represented by a single XML tag
s; v; x; w

<text> with the attributes that represent individual values of the tuple: the ontent
attribute arries the text ontent of the elements and the attributes visual, expr and

58
<do ument title="Sample" url="http://sample.org">
<text ontent="Personal data" visual="2"
expr="3" weight="3">
<text ontent="Name" ...>
<text ontent="John Smith" .../>
</text>
<text ontent="E-mail" ...>
<text ontent="johnjohnsmith.org" .../>
</text>
</text>
</do ument>

Figure 6.3: XML representation of the logi al stru ture model

weight ontain the values of , and respe tively. Furthermore, a <do ument> tag
v x w

is used as a root element. Figure 6.3 shows an example representation of a logi al


stru ture of the do ument shown in the gure 5.6. The attributes of some elements
have been omitted for greater larity. The full DTD of the logi al stru ture XML
representation an be found in appendix B.3.
6.3 Implementation
All the system modules have been implemented on the Java platform. The main reasons
for this hoi e were:
 Portability of the resulting produ t
 HTTP proto ol support
 HTML parser available
 Large set of suitable data stru tures, mainly trees
On the other hand, Java brings a drawba k in the form of slightly worse performan e
of the running appli ation. However, this point is not riti al for the prototype.
Ea h module is implemented as a standalone appli ation. Ea h of the modules
ex ept the interfa e module an run either separately as an appli ation with a graph-
i al user interfa e or together with other modules. We will now brie y des ribe the
implementation of ea h module.
6.3.1 Interfa e Module

This module provides the interfa e between the system and the Internet. It works as
an HTTP proxy, the ommuni ation with the remaining modules is implemented using
the HTTP proto ol [24℄. Ea h module sends a request on a do ument with a ertain

59
URI. The interfa e module downloads the do ument and stores it lo ally for later use.
Then it generates an HTTP reply ontaining the ode of the do ument. The stored
do uments are valid for a single information extra tion task only. For improving the
reliability of the downloading pro ess, any publi ly available general HTTP a he an
be used as for example Squid1.
6.3.2 Logi al Do ument Module

When provided with a URI, this module analyzes the stru ture of links leading from this
do ument and dis overs the boundaries of the do uments. As the result, it produ es
the list of the URIs of physi al do uments.

Figure 6.4: Logi al do ument module

6.3.3 Analysis Module

This module implements the methods of HTML ode parsing proposed in the se tion
5.4, visual information modeling in do uments that has been proposed in se tion 5.2
and transforming this model to the logi al stru ture model as proposed in se tion 5.5.
The resulting model of the logi al stru ture is represented using an XML do ument as
des ribed in se tion 6.2.2.
6.3.4 Extra tion Module

The extra tion module implements the tree mat hing algorithm de ned in se tion 5.6.
Given the logi al stru ture of the do ument obtained from the analysis module, it
1
http://www.squid- a he.org

60
Figure 6.5: Analysis module

attempts to lo ate subtrees of the logi al stru ture that orrespond to the extra tion
task spe i ation.
The task spe i ation orresponds to a template of a subtree as shown on an ex-
ample in the gure 5.12 (page 52). Again, we use XML for de ning the information
extra tion task. The XML spe i ation has following format:
<task name="...">
<input>
<!-- Element format spe ifi ations -->
</input>
<model name="...">
<!-- data and label hierar hy spe . -->
</model>
</task>
The whole spe i ation is en losed in the <task> tag and it is assigned an unique
name. The spe i ation onsists of two parts. In the <input> part, the formats of text
elements are spe i ed. Ea h format spe i ation onsists of a name that identi es the
format and a regular expression that spe i es the format itself. Additionally, it an
be spe i ed if the regular expression is ompared in a ase sensitive or ase insensitive
manner. For example, the format for a name of a department an be spe i ed as
<spe ase="lower" name="dept">^[a-z0-9\ $℄+</spe >
The <model> part of the task spe i ation de nes the template tree. In this de -
nition, two tags an be used: the <label> tags orresponds to a label in the template,
<data> tag orresponds to a text eld ontaining a data value. These tags an be
arbitrarily nested in order to reate the orresponding hierar hy. Both these tags spe -
ify an expe ted format of a logi al stru ture tree node by using one or more format
spe i ation from the <input> se tion, for example
<label spe ="poslabel|joblabel"/>

61
means that the node mat hes to a node in the logi al do ument stru ture that has either
the \poslabel" or the \joblabel" format that have been previously spe i ed using regular
expressions in the <input> se tions. The di eren e between <label> and <data> is
that the label is just mat hed to a stru ture tree node whereas the data forms part of
the extra ted output data. Therefore, a data node must have a name attribute set, e.g.
<data spe ="posval" name="position"/>
that orresponds to the name of the eld in the output data. An example of a omplete
extra tion task spe i ation is in Appendix A.
The module provides a graphi al user interfa e (see the gure 6.6) that allows to
browse the information extra tion task and the tree of the logi al stru ture and to
browse the paths in the trees that orrespond to ea h other.

Figure 6.6: Extra tion module

6.3.5 Control Panel

Control panel is a simple appli ation that provides an uni ed user interfa e for all the
modules. It allows to spe ify the target URL and the extra tion task spe i ation
le. Then, it alls subsequently the individual modules and nally, it displays the
information extra tion results.

62
Figure 6.7: Control panel

6.4 Information Extra tion Output


The output of the information extra tion task is a sequen e of data re ords where ea h
data re ord onsists of the values of named data elds that have been de ned in the
extra tion task spe i ation. In our systems, two output formats are supported: a
XML le and SQL s ript for storing the data to a relational database.
6.4.1 Extra ted Data as an XML Do ument

The format of the resulting XML le is derived from the extra tion task spe i ation:
 The sequen e of all the output re ords is en losed in a root tag whose name
orresponds to the spe i ed extra tion task name.
 Ea h output re ord is en losed in a tag whose name orresponds to the name of
the extra tion model.
 Ea h re ord onsists of data elds that are en losed in tags whose names orre-
spond to the spe i ed data eld names in the model se tion of the task spe i -
ation.
An output do ument for the example task shown in Appendix A would have fol-
lowing stru ture:
<?xml version="1.0" en oding="iso-8859-2"?>
<staff>
<person>
<name>...</name>
<department>...</department>
<email>...</email>
</person>
<person>
...
</person>
...
</staff>

6.4.2 Extra ted Data as an SQL S ript

The resulting SQL s ript onsists of a sequen e of INSERT ommands that store the
extra ted re ords to a database table. The ommands have following format:

63
 The name of the target table orresponds to the name of the name of the extra -
tion model spe i ed using the <model> tag in the extra tion task spe i ation.
 The names of the table olumns orrespond to the spe i ed data eld names in
the model se tion of the task spe i ation.
An output s ript for the example task in Appendix A would have following format:
INSERT INTO person (name, department, email) VALUES (..., ..., ...);
INSERT INTO person ...
...
We assume that the orresponding database table has been reated in advan e by
the user. In order to reate the appropriate SQL ommands automati ally, it would be
ne essary to extend the extra tion task spe i ation by a posibility of spe ifying the
data types of individual elds.

64
Chapter 7

Method Evaluation
In order to evaluate the fun tionality of the proposed method, we have rst tested the
proposed information extra tion method on sele ted sets of physi al HTML do uments
available in the World Wide Web. As the next step, we have run the tests on larger
logi al do uments on the same web sites.
7.1 Experiments on Physi al Do uments
We have tested the method on two di erent ases. The rst experiment was extra ting
the personal information from various university sta member pages. These pages
are usually very stri tly formatted and regular. The se ond experiment deals with
sto k quote servers. In this ase the do uments are signi antly more omplex, ontain
various kinds of data and their stru ture exhibits various irregularities.
7.1.1 Experiment 1 - Personal Information

We have tested the proposed method on a simple task that onsists of pro essing the
dire tory pages of various universities and extra ting the name, e-mail and department
of the persons listed in these pages. We have de ned two extra tion models that di er
in the expe ted logi al stru ture of the presented information in the page. The two
possibilities are shown in the gure 7.1.
Name *

Department E−mail Name Department E−mail

Figure 7.1: Two variants of the expe ted logi al stru ture

The rst variant orresponds to the situation where the name of the person is used
as a page title or a heading. The se ond variant orresponds to the situation where

65
# URL Pre ision % Re all % Variant
1 fit.vutbr. z 91 100 a
2 mff. uni. z 100 100 a
3 is.muni. z 100 52 a
4 mit.edu - - a
5 ornell.edu 100 100 a
6 yale.edu 100 100 a
7 stanford.edu 100 100 b
8 harvard.edu 100 100 b
9 us .edu - - b
10 psu.edu 100 100 b
Table 7.1: Sample extra tion task results

the name is presented on the same level as the remaining data elds. By adding the
expe ted eld formats and the expe ted labels as dis ussed in se tion 5.6 we obtain
a template tree for ea h variant (we will all these variants A and B) as shown in
the gure 7.2. The XML extra tion task spe i ation of both variants is in luded in
Appendix A. For simpli ity, all the regular expressions ex ept the name value have
the ase=lower attribute set in the spe i ation so that the text is always onverted
to lower ase before being mat hed with the regular expression. The format of name is
mat hed in the ase-sensitive manner sin e the name should start with a apital letter.
^[A−Z][A−Za−z,\.\ ]+$ .*

department e−?mail name department e−?mail


dept dept
^[A−Z][A−Za−z,\.\ ]+$
^[a−z\ \.]+$
^[a−z\ \.]+$
^[a−z0−9_\.]+@[a−z0−9_\.]+$
^[a−z0−9_\.]+@[a−z0−9_\.]+$

Figure 7.2: Template trees (variants A and B)

As the data sour e we have used sets of sta personal pages from various universities.
Table 7.1 shows values of pre ision (3.1) and re all (3.2) as de ned in se tion 3.1.3.
From ea h listed site, we have taken 30 random personal pages. The only input for the
information extra tion is the URI of the do ument and the appropriate tree template
A or B.
There are three basi reasons that may ause the extra tion pro ess to fail:
 The extra ted data is not labeled as expe ted; e.g. in the testing set 3 the e-mail
addresses are not denoted by any label

66
 The data to be extra ted is ontained inside of larger text elements; e.g. the
appropriate data at mit.edu (4) is presented as unformatted text only and at
us .edu the labels are not distinguished from the values
 Various anti-spam te hniques used in the do uments su h as writing e-mail ad-
dresses in some non-standard form or presenting is as an image
During the tests, the parameters have been set to QS K I P = 0 and M SK I P= 1.
In reasing QS K I P and M SK I P an improve the re all in ase that the format of the
data elds spe i enough; e.g. the e-mail address an be dis overed by its format only
so that it is not ne essary to require any label. Generally, in reasing these values auses
a signi ant loss of pre ision.
7.1.2 Experiment 2 - Sto k Quotes

As a se ond experiment we have extra ted some quote data from publi ly available
quote servers. The template tree with the regular expressions is given in the gure 7.3.
The task onsisted of extra ting the last pri e, the hange of the last pri e, the opening
pri e, the last bid and the traded volume. Examples of some of the do uments are
shown in the gure 7.4.
.*

^last change open bid vol

[0−9\.]+ [0−9\.]+ [0−9\.]+ [0−9\.]+ [1−9][0−9,]+


N/A

Figure 7.3: Template tree for extra ting the quote data

The results are summarized in the table 7.2. The testing sets 2 and 6 both use tables
for presenting the data in quite ompli ated way so that the logi al stru ture of the
table is not dete ted properly. It would be ne essary to further improve the algorithms
for the table analysis in order to obtain the proper results. The lower re all for the
testing sets 3, 4 and 5 has the usual reason { the data is not labeled as expe ted by the
extra tion task spe i ation. The solution is to reate more variants of the expe ted
logi al stru ture as dis ussed in se tion 5.6.1.
7.2 Independen e on Physi al Realization
In the gure 7.5 we an see a omparison of the web dire tories of two di erent univer-
sities that orrespond to the yale.edu and ornell.edu testing sets. The data from
both these sites have been orre tly extra ted using a single template tree de nition
although ea h of them uses a distin t way of data presentation and the HTML odes

67
finan e.yahoo. om quote server

quote. om server

Figure 7.4: The designs of two di erent quote servers

68
# URL Pre ision % Re all %
1 finan e.yahoo. om 100 100
2 db . om 0 0
3 quote. om 100 63
4 p quote. om 100 59
5 money. nn. om 100 44
6 uk.finan e.yahoo. om 0 0
Table 7.2: Sample extra tion task results

of both do uments are signi antly di erent. Thus, the proposed extra tion method is
to a great extent independent on the physi al realization of the HTML do uments.
However, the method still depends on the way of the data presentation; i.e. on
the logi al stru ture of the do uments. Considering the experiments des ribed in the
previous se tion, this di eren e auses the ne essity of two variants of the template
tree in the \personal data" experiment and, when only a single template tree is used,
it auses lower re all of the results as show the results of the se ond, \sto k quote"
experiment. The di eren e between the logi al stru tures of the do uments is apparent
from a visual omparison of the samples. Let's ompare the examples of the variant A
of the rst experiment that are shown in the gure 7.5 with the example of the variant
B that is in the gure 7.6. We an noti e that in the former ase, the name of the person
is presented as being superior to the remaining data (it is used as a title) whereas in
the later ase, the name of the person is listed in the same way as the remaining data.
The results show that in the proposed method, the independen e on the HTML ode
is partially repla ed by the dependen e on the logi al do ument stru ture. However,
omparing with the number of various wrappers that would be ne essary for extra ting
the same data from the listed web sites (typi ally, one for ea h site), our method brings
a signi ant improvement.
7.3 Information Extra tion from Logi al Do uments
The proposed method for extra ting data from logi al do uments has been developed
for pro essing the logi al do uments that onsist of stati web pages. However, in the
real World Wide Web, it is not easy to nd stati logi al do uments ontaining large
amounts of data. In both ases { the university sta dire tories and the quote servers,
it is impossible to present all the data in stati do ument due to its great amount.
Instead, the data is stored in a database and the do uments are generated dynami ally
upon a user query. For example, the sta dire tories typi ally require entering at least
the last name of a person and return a do ument ontaining the data of all the sta
members of that last name. Therefore, the entire dire tory ontent annot be a essed
dire tly through the web interfa e. At this point, we fa e the problem of the hidden
web as mentioned in se tion 2.3.
For the mentioned reasons, we have used following method for obtaining the logi al

69
Yale University Dire tory

Cornell University Dire tory


Figure 7.5: Di erent designs of a university dire tory (variant A)

70
Figure 7.6: An example of the variant B { Stanford University Dire tory

do uments from the sta dire tories in order to evaluate the proposed method.
We have hosen some frequent last names, namely \Novak", \Dvorak", \Smith"
and \Johnson". With the standard web browser we used these names for querying the
web dire tories of the above listed universities. In all the tested dire tories, the query
results in a dynami do ument, that ontains the list of mat hing sta members and
the sear hed name is ontained in the URI of this do ument. For example, the list
of all the sta members of Stanford University with the name ontaining \Novak" is
available in a do ument with the following URI:
https://stanfordwho.stanford.edu/lookup?sear h=Novak&submit=Sear h
This example shows an URI of a dynami ally generated do ument with two arguments:
sear h and submit. From this point, no user queries are ne essary and we an regard
this do ument as the main page of a logi al do ument that ontains the data of all
the listed persons that mat h the query. Note that this pro edure is not required by
the proposed information extra tion method as su h; it is only a way of obtaining a
suÆ ient amount of logi al do uments for the method evaluation.
On e we have determined the URI of the main page of the logi al do ument, we an
run the information extra tion task on this URI. Sin e this logi al do ument ontains
the personal pages of all the mat hing persons and the logi al stru tures of the indi-
vidual personal pages will appear as subtrees in the logi al stru ture tree of the whole
logi al do ument, the results of in rmation extra tion should be the same as if the in-
dividual personal pages of the listed persons were pro essed individually. This fa t has
been on rmed by the pra ti al tests { the information extra tion results orrespond

71
to the results for the individual physi al do uments that have been pro essed during
the tests des ribed in se tion 7.1.1 and that are summarized in the table 7.1. The used
logi al do ument dis overy algorithm (se tion 5.7) however tends to in lude more ad-
ditional pages to the logi al do ument that do not in uen e the information extra tion
result but the download and pro essing of these ex essive pages is time- onsuming.
For example, in ase of the Harvard University, 62 do uments have been downloaded
during the logi al do ument dis overy whereas the data to be extra ted was ontained
in 4 of them.
In ase of the quote servers, no summary pages are available that ould be used as
the main pages of some logi al do uments. For obtaining the data, the quote symbols
must be entered exa tly and therefore, the generated do uments must be analyzed
separately.

72
Chapter 8

Con lusions
Proposed method of information extra tion is usable for real data-intensive do uments
available through the World Wide Web. With data-intensive do uments we mean the
do uments that are primarily intended for presenting data in a relatively regular and
stru tured form where the visual information plays an important role for the readers'
understanding of the do ument. These do uments often ontain up-to-date data that
are worth extra ting and typi ally, the do uments are automati ally generated from a
ba k end database. On the other hand, the method is not suitable for pro essing the
do uments where the desired information is buried in large blo ks of unformatted or
poorly formatted text. For su h do uments, the traditional text do ument pro essing
and natural language pro essing methods are more appli able.
There is one more important issue in information extra tion as su h that has not
a te hni al nature. Quite often, the provider of the information that is presenting it
on the web prefers browsing the do uments by people rather than the automati tools,
mainly for marketing reasons (advertisement et .). Su h subje ts often use various
te hniques that ompli ate automati pro essing of the do uments su h as dete tion
and blo king of the lient or \hiding" the information in the do ument. There are,
however, still many areas where the information extra tion is justi able and useful.
8.1 Summary of Contributions
Proposed method presents a novel approa h to information extra tion from HTML
do uments. In ontrast to urrent methods based mainly on the dire t analysis of the
HTML ode, our method has following important features:
 Independen e on the underlying HTML ode of the do ument. The do ument is
des ribed by a abstra t model, whi h is then used for extra ting information.
This abstra tion avoids the dependen e on parti ular HTML tags, whi h is the
bottlene k of the wrapper approa h.
 Resistan e to the hanges of do uments. The use of abstra t model ensures that
the method is resistant to hanges in the data presentation in the do ument unless
the logi al stru ture of the do ument hanges.

73
 No training phase required. The information extra tion pro ess an start as soon
as the extra tion task spe i ation is nished. There is no training set of exam-
ple do uments needed. The method allows pro essing new, previously unknown
do uments that orrespond to the extra tion task spe i ation.
Aside from this main ontribution, there are some issues in the method proposal
that we onsider a signi ant or novel ontributions:
 Formal models of the visual information in the do ument. To the author's best
knowledge, this is a rst attempt to formally des ribe the information
that is given to the user by visual means. We propose formal models of
two omponents of this information { the page layout and the visual attributes
of the text.
 Modeling the logi al stru ture on the basis of the visual model. This approa h is
unique for the pro essing of HTML do uments although similar idea has been pro-
posed for other types of do uments. The model of the logi al stru ture presents
an important information about the do ument that an be used (aside from infor-
mation extra tion) in many other areas su h as information retrieval (sear hing
do uments based on stru tured queries instead of single keywords) or alternative
do ument presentation (e.g. stru ture-aware voi e readers for blind people).
 Appli ation of the tree mat hing algorithms. Although the hierar hi al organi-
zation of HTML do uments is ommonly a epted, the use of tree mat hing
algorithms for this task an be onsidered a novel ontribution to this area.
Finally, the ontribution of the thesis is an experimental information extra tion
system that implements the proposed te hniques. This system has been implemented
in the Java environment and has been used for verifying the method in the real world.
8.2 Possible Improvements and Future Work
From the point of view of further improvements of the method, the most important
point seems to be the proposed algorithm for analyzing the logi al stru ture of HTML
tables. Although these algorithm works satisfa torily for most of the do uments, there
exist more omplex ways of presenting data in a table that are not re ognized properly
as results for example from the test results given in the se tion 7.1.2. For being able to
pro ess a larger set of possible variants, a more sophisti ated analysis method should
be developed.
The se ond issue is the way of using the logi al do ument stru ture for information
extra tion. In our method, we use the tree mat hing algorithms and we avoid any
ma hine learning phase. However, for some appli ations, it would be interesting to
use some ma hine learning algorithms for inferring the extra tion task spe i ation
automati ally.

74
Bibliography
[1℄ Adelberg, B. NoDoSe { A Tool for Semi-Automati ally Extra ting Stru tured
and Semistru tured Data from Text Do uments. In Pro eedings of the 1998 ACM
SIGMOD international onferen e on Management of data Seattle, Washington,
United States, 1998
[2℄ Anjewierden, A. AIDAS: In remental Logi al Stru ture Dis overy in PDF Do -
uments In 6th International Conferen e on Do ument Analysis and Re ognition
(ICDAR). Seattle, USA, 2001
[3℄ Ashish, N., Knoblo k, C. Wrapper Generation for Semi-stru tured Internet
Sour es. In Workshop on Management of Semistru tured Data. Tu son, Arizona,
1997
[4℄ Atzeni, P., Me a, G., Merialdo, P. Semistru tured and Stru tured Data in the
Web: Going Ba k and Forth. In Pro eedings of ACM SIGMOD Workshop on
Management of Semi-stru tured Data. 1997
[5℄ Baumgartner, R., Fles a, S., Gottlob, G. Visual Web Information Extra tion with
Lixto. In Pro eedings of the 27th International Conferen e on Very Large Data
Bases. Roma, Italy, 2001
[6℄ Berners-Lee, T. The Semanti Web. S ienti Ameri an. May 2001
[7℄ Bos, B., Lie, H.W., Lilley, C., Ja obs, I. (editors). Cas ading Style
Sheets, level 2, CSS2 Spe i ation. W3C Re ommendation 12 May 1998.
http://www.w3.org/TR/1998/REC-CSS2-19980512
[8℄ Burget, R. Analyzing Logi al Stru ture of a Web Site. In Pro eedings of 5th In-
ternational Conferen e ISM '02 - Information Systems Modelling. Ostrava, CZ,
MARQ, 2002, p. 29-35, ISBN 80-85988-70-4
[9℄ Burget, R. HTML Do ument Analysis for Information Extra tion. In Pro eedings
of 8th EEICT onferen e. Brno, CZ, FIT VUT, 2002, p. 426-430, ISBN 80-214-
2116-9
[10℄ Burget, R. Information Extra tion from WWW Based on the Data Stru ture
Knowledge (in ze h language). In Pro eedings of the 2nd onferen e Znalosti 2003.
Ostrava, CZ, FEI VSB, 2003, p. 271-280, ISBN 80-248-0229-5

75
[11℄ Burget, R. Hierar hies in HTML Do uments: Linking Text to Con epts. A epted
for 3rd International Workshop on Web Semanti s - WebS '04. Zaragoza, Spain,
2004
[12℄ Buttler, D., Liu, L., Pu, C. A Fully Automated Obje t Extra tion System for
the World Wide Web. In Pro . of IEEE International Conferen e on Distributed
Computing Systems. 2001
[13℄ Car hiolo, V., Longheu, A., Malgeri, M. Extra ting Logi al S hema from the Web",
In PRICAI Workshop on Text and Web Mining. Melbourne, Australia, 2000
[14℄ Chung, C.Y., Gertz. M., Sundaresan, N. Reverse Engineering for Web Data: From
Visual to Semanti Stru tures. In 18th International Conferen e on Data Engi-
neering (ICDE 2002). IEEE Computer So iety, 2002.
[15℄ Ciravegna, F. (LP)2 , an Adaptive Algorithm for Information Extra tion from Web-
related Texts. In Pro eedings of the IJCAI-2001 Workshop on Adaptive Text Ex-
tra tion and Mining. Seattle, USA, 2001
[16℄ Clark, P., Niblett, T. The CN2 Indu tion Algorithm. Ma hine Learning, 3:261-283,
1989
[17℄ Cohen, W.W., Hurst, M., Jensen, L.S. A Flexible Learning System for Wrapping
Tables and Lists in HTML Do uments. In Pro eedings of the Eleventh International
World Wide Web Conferen e. Honolulu, Hawaii, USA, 2002
[18℄ Cres enzi, V., Me a, G., Merialdo, P. RoadRunner: Towards automati data ex-
tra tion from large web sites. Te hni al Report n. RT-DIA-64-2001, D.I.A. Uni-
versita di Roma Tre, 2001
[19℄ Cres enzi, V., Me a, G., Merialdo, P. Automati Web information Extra tion in
the RoadRunner System. In International Workshop on Data Semanti s in Web
Information Systems. Yokohama, Japan, 2001
[20℄ Deogun, J.S., Sever, H., Raghavan, V.V. Stru tural Abstra tions of Hypertext
Do uments for Web-based Retrieval In Pro eedings of DEXA 98 - 9th International
Conferen e on Database and Expert Systems Appli ations. Vienna, Austria, 1998
[21℄ DiPasquo, D. Using HTML Formatting to Aid in Natural Language Pro essing on
the World Wide Web. S hool of Computer S ien e, Carnegie Mellon University,
Pittsburgh, 1998
[22℄ Embley, D.W., Campbell, D.M., Jiang, Y.S., Ng, Y.-K., Smith, R.D., Liddle, S.W.,
Quass, D.W. A on eptual-modeling approa h to extra ting data from the web.
In Pro . of the 17th International Conferen e on Con eptual Modeling (ER'98).
Singapore, 1998
[23℄ Embley, D.W., Jiang, Y.S., Ng, Y.-K. Re ord-boundary dis overy in Web do u-
ments. In Pro . of the 1999 ACM SIGMOD International Conferen e on Manage-
ment of Data 1999

76
[24℄ Fielding, R., et al. Hypertext Transfer Proto ol { HTTP/1.1. RFC2616, The Com-
puter So iety, 1999. http://rf .net/rf 2616.html
[25℄ Freitag, D. Using Grammati al Inferen e to Improve Pre ision in Information Ex-
tra tion. In ICML-97 Workshop on Automata Indu tion, Grammati al Inferen e,
and Language A quisition. 1997
[26℄ Freitag, D. Information extra tion from HTML: Appli ation of a general learning
approa h. In Pro . of the Fifteenth Conferen e on Arti ial Intelligen e AAAI-98.
1998
[27℄ Freitag, D., M Callum, A. Information Extra tion with HMMs and Shrinkage.
In Pro eedings of the AAAI-99 Workshop on Ma hine Learning for Information
Extra tion. 1999
[28℄ Fuhr, N., Grossjohann, K. XIRQL: An XML Query Language Based on Informa-
tion Retrieval Con epts. In Pro . of the 24th annual international ACM SIGIR
onferen e on Resear h and development in information retrieval. New Orleans,
USA, 2001
[29℄ Gibson, D., Kleinberg, J., Raghavan, P. Infering Web Communities from Link
Topology. In Pro . of 6th Intl. Conferen e on Database Systems for Advan ed Ap-
pli ations (DASFAA'99). IEEE Computer So iety, 1999
[30℄ Gold, E.M. Language Identi ation in the Limit. Information and Control,
10(5):447-474. 1967
[31℄ Grishman, R., Sundheim, B. Message Understanding Conferen e { 6: A Brief
History. In Pro eedings of the 16th International Conferen e on Computational
Linguisti s. Copenhagen, Denmark, 1996
[32℄ Gu, X.-D., Chen, J., Ma, W.-Y., Chen, G.-L. Visual Based Content Understanding
towards Web Adaptation. In Pro . Adaptive Hypermedia and Adaptive Web-Based
Systems. Malaga, Spain, 2002, pp. 164-173
[33℄ Guan, T., Wong, K.F. KPS { a Web Information Mining Algorithm. In The 8th
International World Wide Web Conferen e. Toronto, Canada, 1999
[34℄ Guttner, J. Obje t Database on Top of the Semanti Web. In Pro eedings of the
WI/IAT 2003 Workshop on Appli ations, Produ ts and Servi es of Web-based Sup-
port systems. Halifax, CA, 2003, pp. 97-102
[35℄ Hong, T.W., Clark, K.L. Using Grammati al Inferen e to Automate Information
Exra tion from the Web. In Prin iples of Data Mining and Knowledge Dis overy.
2001
[36℄ Kan, M.-Y. Combining visual layout and lexi al ohesion features for text segmen-
tation. Columbia University Computer S ien e Te hni al Report, CUCS-002-01.
2001

77
[37℄ Kosala, R., Van den Buss he, J., Bruynooghe, M., Blo keel, H. Information Ex-
tra tion in Stru tured Do uments using Tree Automata Indu tion, In Prin iples
of Data Mining and Knowledge Dis overy, Pro eedings of the 6th International
Conferen e (PKDD-2002). 2002
[38℄ Kushmeri k, N., Weld, D.S., Doorenbos, R.B. Wrapper Indu tion for Information
Extra tion, In International Joint Conferen e on Arti ial Intelligen e. 1997
[39℄ Kushmeri k, N. Wrapper Indu tion: EÆ ien y and Expressiveness. Ar ti ial In-
telligen e vol. 118, no. 1-2, pp. 15-68, 2000
[40℄ Kushmeri k, N. Wrapper Veri ation. World Wide Web Journal vol. 3, no. 2, pp.
79-94, 2000
[41℄ Kushmeri k, N., Thomas, B. Adaptive information extra tion: Core te hnolo-
gies for information agents In Intelligent Information Agents R&D in Europe: An
AgentLink perspe tive Le ture Notes in Computer S ien e 2586, Springer 2002.
[42℄ Lin, S.-H., Ho, J.-M. Dis overing Informative Content Blo ks from Web Do -
uments. In The Eighth ACM SIGKDD International Conferen e on Knowledge
Dis overy and Data Mining (SIGKDD'02). 2002
[43℄ Liu, B., Grossman, R., Zhai, Y. Mining Data Re ords in Web Pages In Pro eedings
of the ACM SIGKDD International Conferen e on Knowledge Dis overy and Data
Mining (KDD-2003). Washington, DC, USA, 2003
[44℄ Manola, F., Miller, E. (editors) RDF Primer. W3C Working Draft 11 November
2002
http://www.w3.org/TR/2002/WD-rdf-primer-20021111/
[45℄ May, W. Modeling and Querying Stru ture and Contents of the Web. In Pro eed-
ings of the 10th International Workshop on Database and Expert Systems Appli-
ations Floren e, Italy, 1999
[46℄ Mello, R., Heuser, C.A. A Bottom-Up approa h for Integration of XML Sour es In
International Workshop on Information Integration on the Web. Rio de Janeiro,
Brasil, 2001
[47℄ Morkes, J., Nielsen, J. Con ise, SCANNABLE, and Obje tive: How to Write for
the Web, 1997
http://www.useit. om/papers/webwriting/writing.html
[48℄ Mukherjee, S., Yang, G., Tan, W., Ramakrishnan, I.V. Automati dis overy of Se-
manti Stru tures in HTML Do uments. In International Conferen e on Do ument
Analysis and Re ognition (ICDAR). 2003
[49℄ Muslea, I., Minton, S., Knoblo k, C.A. Hierar hi al Wrapper Indu tion for
Semistru tured Information Sour es. Journal of Autonomous Agents and Multi-
Agent Systems 4:93-114, 2001

78
[50℄ Nahm, U.Y., Mooney, R.J. Text Mining with Information Extra tion. In Pro eed-
ings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and
Knowledge Bases. 2002.
[51℄ Nelson, T.H. Complex information pro essing: a le stru ture for the omplex, the
hanging and the indeterminate. In ACM/CSC-ER, Pro eedings of the 1965 20th
national onferen e. Cleveland, USA, 1965
[52℄ Niu, C., Li, W., Ding., J., Srihari, R. A Bootstrapping Approa h to Named Entity
Classi ation Using Su essive Learners. In Pro eedings of the ACL-2003 Confer-
en e. Sapporo, Japan, 2003
[53℄ Quinlan, J.R., Cameron-Jones, R.M. FOIL: A Midterm Report, Ma hine Learning:
ECML-93, Vienna, Austria, 1993
[54℄ Raggett, D., Le Hors, A., Ja obs, I. (editors). HTML
4.01 Spe i ation. W3C Re ommendation 24 De ember 1999.
http://www.w3.org/TR/1999/REC-html401-19991224
[55℄ Salton, G. Re ent Studies in Automati Text Analysis and Do ument Retrieval.
JACM, 20(2):258-278, Apr. 1973
[56℄ Shasha, D., Wang, J.T.L, Shan, H., Zhang, K. ATreeGrep: Approximate Sear hing
in Unordered Trees. In 14th International Conferen e on S ienti and Statisti al
Database Management. Edinburgh, S otland, 2002
[57℄ Soderland, S. Learning to Extra t Text-based Information from the World Wide
Web, In Pro eedings of Third International Conferen e on Knowledge Dis overy
and Data Mining (KDD-97). 1997
[58℄ Summers, K. Toward a Taxonomy of Logi al Do ument Stru tures. In Ele troni
Publishing and the Information Superhighway: Pro eedings of the Dartmouth Insti-
tute for Advan ed Graduate Studies (DAGS '95). Boston, USA, 1995, pp. 124-133
[59℄ Summers, K. Automati Dis overy of Logi al Do ument Stru ture. PhD thesis.
Cornell Computer S ien e Department Te hni al Report TR98-1698, 1998
[60℄ Tajima, K., Tanaka, K. New Te hniques for the Dis overy of Logi al Do uments
in Web. In International Symposium on Database Appli ations in Non-Traditional
Environments (DANTE'99). 1999
[61℄ World Wide Web Consortium (W3C) pages. http://www.w3.org/
[62℄ Yang, Y., Zhang, H. HTML Page Analysis Based on Visual Cues. In Pro . of 6th
International Conferen e on Do ument and Analysis. Seattle, USA, 2001
[63℄ Zizi, M., Lafon, M. Hypermedia exploration with intera tive dynami maps. In-
ternational Journal on Human Computer Intera tion Studies. 1995

79
Appendix A

Example Task Spe i ation


This example shows an XML spe i ation of two information extra tion tasks des ribed
in the se tion 7.1.
Experiment 1 - Personal Information
Variant A

<?xml version="1.0" en oding="iso-8859-2"?>


<task name="staff">
<!-- Input field spe and path spe ifi ation -->
<input>
<spe ase="lower" name="namelabel">name</spe >
<spe ase="sensitive"
name="nameval">^[A-Z℄[A-Za-z,\.\ ℄+$</spe >

<spe ase="lower" name="deplabel">department</spe >


<spe ase="lower" name="dptlabel">dept</spe >
<spe ase="lower" name="depval">^[a-z0-9\ $℄+</spe >

<spe ase="lower" name="maillabel">e-?mail</spe >


<spe ase="lower"
name="mailval">^[a-z0-9_\.℄+[a-z0-9_\.℄+</spe >
</input>

<!-- Inspe ion model (field hierar hy) -->


<model name="person">
<data spe ="nameval" name="name">
<label spe ="deplabel|dptlabel">
<data spe ="depval" name="department"/>
</label>
<label spe ="maillabel">
<data spe ="mailval" name="email"/>

80
</label>
</data>
</model>
</task>

Variant B

<?xml version="1.0" en oding="iso-8859-2"?>


<task name="staff">
<!-- Input field spe and path spe ifi ation -->
<input>
<spe ase="lower" name="namelabel">name</spe >
<spe ase="sensitive"
name="nameval">^[A-Z℄[A-Za-z,\.\ ℄+$</spe >

<spe ase="lower" name="deplabel">department</spe >


<spe ase="lower" name="dptlabel">dept</spe >
<spe ase="lower" name="depval">^[a-z0-9\ $℄+</spe >

<spe ase="lower" name="maillabel">e-?mail</spe >


<spe ase="lower"
name="mailval">^[a-z0-9_\.℄+[a-z0-9_\.℄+</spe >
</input>

<!-- Inspe ion model (field hierar hy) -->


<model name="person">
<label spe ="namelabel">
<data spe ="nameval" name="name"/>
</label>
<label spe ="maillabel">
<data spe ="mailval" name="email"/>
</label>
<label spe ="deplabel|dptlabel">
<data spe ="depval" name="department"/>
</label>
</model>
</task>

Experiment 2 - Sto k Quotes


<?xml version="1.0" en oding="iso-8859-2"?>
<task name="quotes">
<!-- Input field spe and path spe ifi ation -->
<input>
<spe ase="lower" name="float">[0-9\.℄+</spe >

81
<spe ase="lower" name="int">[1-9℄[0-9,℄+</spe >
<spe ase="lower" name="na">N/A</spe >

<spe ase="lower" name="lastlabel">^last</spe >


<spe ase="lower" name=" hglabel"> hange</spe >
<spe ase="lower" name="openlabel">^open</spe >
<spe ase="lower" name="bidlabel">bid</spe >
<spe ase="lower" name="vollabel">vol</spe >
</input>

<!-- Inspe ion model (field hierar hy) -->


<model name="quote">
<label spe ="lastlabel">
<data spe ="float" name="last"/>
</label>
<label spe =" hglabel">
<data spe ="float" name=" hange"/>
</label>
<label spe ="openlabel">
<data spe ="float" name="open"/>
</label>
<label spe ="bidlabel">
<data spe ="na" name="bid"/>
</label>
<label spe ="vollabel">
<data spe ="int" name="volume"/>
</label>
</model>

</task>

82
Appendix B

Do ument Type De nitions


In following se tions we give the formal Do ument Type De nitions (DTD) for all the
XML do ument formats used in our experimental information extra tion system.
B.1 Task Spe i ation
Following DTD de nes the XML do ument format of the extra tion task spe i ation.
An example of su h spe i ation is given in Appendix A. It allows to spe ify the formats
of all elds by regular expressions in the <input> se tion and to de ne the template
tree in the <model> se tion. This tree onsists of the <label> and <data> nodes. Both
these tags spe ify an expe ted format of a logi al stru ture tree node by using one or
more format spe i ation from the <input> se tion. The di eren e between label and
data is that the label is just mat hed to a stru ture tree node whereas the data forms
part of the extra ted output data.
<?xml version="1.0" en oding="UTF-8"?>

<!-- The root element -->


<!ELEMENT task (input, model)>
<!ATTLIST task
name CDATA #REQUIRED>

<!-- Field format spe ifi ations -->


<!ELEMENT input (spe *)>

<!-- Spe ifi ation of a named field format. The #PCDATA ontains
a regular expression. The omparison may be either ase
sensitive or lower ase. -->
<!ELEMENT spe (#PCDATA)>
<!ATTLIST spe
name CDATA #REQUIRED
ase (lower|sensitive) "lower">

83
<!-- Template tree spe ifi ation -->
<!-- A hierar hy of labels (not in luded in the output) and data
(in luded in the output). The spe attribute must orrespond
to a name of any <spe > element above, logi al disjun tion
an be denoted by "name1|name2". -->
<!ELEMENT model (label*, data*)>
<!ATTLIST model
name CDATA #REQUIRED>

<!ELEMENT label (label*, data*)>


<!ATTLIST label
name CDATA ""
spe CDATA "">

<!ELEMENT data (label*, data*)>


<!ATTLIST label
name CDATA #REQUIRED
spe CDATA "">

<!-- End of the DTD -->

B.2 Logi al Do ument Representation


This DTD de nes the XML do ument format for representing the set of URIs that form
a logi al do ument. The <logi al do ument> element represents the whole logi al
do ument with the URI of the main page, the ontained <do ument> elements ontain
the URIs of the physi al do uments an example of the do ument is shown in se tion
6.2 on page 58.
<?xml version="1.0" en oding="UTF-8"?>

<!ELEMENT logi al_do ument (do ument+)>


<!ATTLIST logi al_do ument
url CDATA #REQUIRED>

<!ELEMENT do ument EMPTY>


<!ATTLIST do ument
url CDATA #REQUIRED>

<!-- End of the DTD -->

B.3 Logi al Stru ture Representation


This is a Do ument Type De nition for the XML representation of the resulting logi al
do ument stru ture model that is the produ t of the logi al stru ture analysis. The

84
format of this le is des ribed in detail in se tion 6.2.2 on page 58.
<?xml version="1.0" en oding="UTF-8"?>

<!-- The root element, stru ture of a logi al do ument -->


<!ELEMENT do ument (text*)>
<!ATTLIST do ument
title CDATA ""
url CDATA "">

<!-- A node of the logi al stru ture tree -->


<!ELEMENT text (text*)>
<!ATTLIST text
ontent CDATA ""
visual CDATA #REQUIRED
expr CDATA #REQUIRED
weight CDATA #REQUIRED>

<!-- End of the DTD -->

85