Vous êtes sur la page 1sur 5

Memory Based Learning and the Interpretation of Numbers

in Archeological Reports

Hans Paijmans Sander Wubben


University of Maastricht; Institute for Knowledge University of Tilburg
and Agent Technology (student BDM)
paai@uvt.nl S.Wubben@uvt.nl

ABSTRACT very dierent if we want to re ognize numbers, but the se-


le tion of training instan es and subsequent pro essing is
This paper des ribes the pra ti al appli ation of Memory
rather dierent.
Based Learning algorithms to the retrieval of do uments in 1
As a ase, let us onsider an institution su h as the RACM ,
a spe i domain: ar heologi al papers and reports in the
where a great number of do uments and reports about ar he-
Dut h language. It fo uses on the re ognition of phrases
ologi al ex avations, site surveys and similar do uments are
in texts that ontain numbers, either in arabi or roman
stored ele troni ally (and used). A ess to the information
numerals, as ardinals, or as domain-spe i terms ('Middle
in the reports is either by straightforward keyword sear h
Ages'). The numbers then are interpreted as hronologi al
or by a separate database in whi h relevant attributes of
dates, measures or geographi al lo ations and in orporated
the do uments are entered by human operators. There ex-
in a do ument retrieval system.
ist proje ts to reate a more involved XML markup based
on CIDOC/CRM, e.g, by J. Holmen and his ollaborators,
Keywords [11℄, but here no automati extra tion from instan es in the
text into the tags is envisaged. In general, keyword sear h
Ma hine Learning, Information Retrieval, Named Entity Re-
in the do uments an be automated in a number of ways,
ognition, ar heology
but sear hing for do uments or pages that refer to spe i
semanti ontent is all but impossible if the system is in-
1. INTRODUCTION dexed by keywords only. The generation of a database with
During the last de ades of the twentieth entury mu h attributes by humans is a labourious task, and in pra ti e
resear h has been done in text interpretation and data ex- large ba klogs o ur and/or data will be omitted.
tra tion. This is obviously the result of the fa t that im- So, on the one hand, we have ompli ated ontologies that
mense quantities of (ma hine readable) text and do uments are extremely di ult to satisfy, and on the other super-
have been produ ed over the last twenty years. It was not ial keyword a ess. Our approa h is to start with the key-
unnatural that the methods that promised to extra t infor- word a ess, and add the semanti ontent in stages. For
mation from su h large quantities be ame popular; among this, we are onstru ting an information retrieval system,
them Memory Based Learning, that was able to lassify texts Open Boek, that automati ally extra ts, translates and in-
or parts thereof. dexes su h attributes from the text. However, this is only
Of ourse there is a ertain lag between the development a prelude for a mu h more di ult and ambitious task: the
of s ienti methods, and the appli ation in pra ti e, and as identi ation of phrases and images that refer to individual
far as ommer ial appli ations are on erned, su h appli a- obje ts.
tions are often ' losed sour e'. This means that neither the In this paper we present our progress in the re ognition
exa t method nor the sour e ode of the programs are visible and interpretation of numbers and numeri al data. The sys-
2
to the user or the interested developer. This in turn makes it tem, Open Boek, is a 'use ase' for the CATCH proje t ,
3 4
di ult to apply su h methods to new domains or to rene more in parti ular for RICH and MITCH , but it oers
the performan e. Also, the systems des ribed in the litera- also opportunities for other resear h groups in the proje t,
5
ture as often as not are tailored for English texts. Finally, su h as KICH . The ar heologi al framework is that of of
the energy of main resear h seems to have been entered the National Referen e olle tion(NR ).
on Named Entity re ognition dire ted at the re ognition of
entities like, well, names([2℄). Of ourse the method is not 2. OPEN BOEK: THE CURRENT SYSTEM
6
The design of Open Boek hinges on a few requirements.

1
Rijksdienst voor Ar heologie, Cultuurlands hap en Monu-
Permission to make digital or hard copies of all or part of this work for menten
personal or classroom use is granted without fee provided that copies are 2
Continuous A ess to Cultural Heritage
not made or distributed for profit or commercial advantage and that copies 3
Reading Images in the Cultural Heritage
bear this notice and the full citation on the first page. To copy otherwise, or 4
Mining for Information in Texts from the Cultural Heritage
republish, to post on servers or to redistribute to lists, requires prior specific 5
KennisInfrastru tuur CultuurHistorie
permission from the author. 6
DIR 2007, March 28-29, Leuven, Belgium. The te hni al report on Open Boek an be found online as
Copyright 2007 Hans Paijmans and Sander Wubben. http://www.referentie olle tie.nl/Openboek/ob_rapport.pdf
To start with, the system should be both immediatly usable 3.1 The corpus
from the beginning of the proje t, and remain open for ad- Considering we are applying Natural Language Pro ess-
ditions and hanges. Therefore we adopted a tight modular ing te hnology (NLP) in the eld of ar haeology, we need
approa h, where the modules ('experts') ommuni ated in to have data that is relevant for this spe i domain. To
ASCII les, using Unix text tools where possible. This may obtain this data, we sele ted 75 ar haeologi al reports at
have had some impa t on performan e, but makes it easier 8
random from the site E-Depot Nederlandse Ar heologie ,
to inspe t intermediary results. Also, it invites experiment- dealing with ar haeologi al ex avations and histori al re-
ing with dierent modules that have similar fun tionality. sear h. Together, these 75 reports ontain 420,369 words
Then there is the format of the original do uments. Al- and pun tuation marks.
though some do uments in the olle tion are written in Word,
it was found that this format is not amendable to manip- 3.2 The classes
ulation by third party developers. pdf-les are, and as the
To automati ally make expli it the information embedded
majority of the do uments was available in pdf, we de ided
in numeri s in these ar haeologi al reports, we rst need
to limit ourselves to that format.
to extra t and lassify these numeri s. Before we an do
As a rst step, the pdfs are onverted to individual HTML-
this, we should dene the lasses we want to be found. The
pages and separate images, keeping as mu h of the original
CIDOC/CRM model [5℄, a well-known standard in the eld
layout as possible. Then, the text proper was extra ted from
of ultural heritage, was used as a referen e. A table that
the HTML. One typi al database, of 750 pdf do uments,
shows the lasses we sele ted is in 2.
yielded 50 MB text, ontained in 30.000 pages and as many
The majority of the numeri s we found an be tted into
images. At retrieval time, the HTML pages are displayed,
the CIDOC model. The remaining items proved to be nu-
but a link to the original pdf-le is at all times present. Of
meri s only relevant inside the individual do ument they
ourse, it is not ne essary to have a pdf le; the do uments
were in, su h as image numbers and page numbers; these
an be delivered in HTML dire tly, but most do uments in
are found in the lasses 'Referen e' and 'Other'.
databases like we envisage are in pdf format anyway.

SMART 3.3 Extraction


We rst tokenized the text, seperating words from pun -
The extra ted texts were indexed using the venerable SMART
tuation (the pun tuation hara ters are also onsidered to-
7
program, developed between 1965 and 1970 by Salton , that
kens). A feature extra tor was then programmed to extra t
is however still doing very well in the TREC ontests [3℄.
the numeri s and equivalent phrases from the texts, together
SMART is an implementation of the Ve tor Spa e Model[15℄,
with four words of leading and four words of trailing ontext.
whi h essentially retrieves do uments on keyword ombina-
This means every instan e has nine features that an be used
tions and, most importantly, ranks them a ording to some
in the MBL stage. A simple rule-beased NE (Named Entity)
measure of relevan e.
extra tion omponent made sure not only items like 2 and
The SMART program oers several distin t weighting meth-
100 were extra ted, but also the written, the arabi and the
roman ardinals and ordinals like twee, tweede, 2-de or 2e
ods for individual words and we are still debating whi h is
the best for this parti ular purpose. In any ase, it serves as
(two, se ond, 2nd, et etera). A downside of this NE re og-
a fast and reliable indexing and retrieval engine and so an
nizer is that words that begin with a numeri omponent or
be the basis for a very usable do ument retrieval system.
are similar to numeri s need to be spe i ially dened (or
The reation of the keyword indexes for a database of this
at least the mali ious part of it) so they don't get extra ted.
size is typi ally a matter of a few minutes on a modern PC
Dening 'a hter' a morpheme that is not to be extra ted will
running SuSE Linux.
prevent words like a hternaam, a hterbank and a hter from
being extra ted when sear hing for ombinations beginning
Memory Based Learning
with a ht (eight). An english equivalent would be 'ten' be-
Then, we use MBL (Memory Based Learning) te hniques ing a prex in 'tend'; the re ognizer would then extra t 'ten'
to extra t 'interesting' phrases from the texts, e.g. hrono- and 'tenthousand' but not 'tender' or 'tenden ies'.
logi al dates, measures and similar onstru ts (see se tion We tested the feature extra tor on a small dataset on-
3 below). These phrases then are parsed a ording to their sisting of random pages from random ar haeologi al reports.
lass, and the disambigued and normalized information is This set ontained 9,158 tokens. The human annotator
inserted as tags in the HTML-le. New indexes are reated found 506 numbers in this set whereas the feature extra tor
from these tags. The nal result is that the system knows found 518. Of these ndings, 498 were identi al. Assuming
that a parti ular day or year lies within the 'Middle Ages', or the human annotator made no mistakes, this results in a
in the twelfth entury, or in the XII-th entury or whatever pre ision of 0,96 and a re all of 0,98 for the feature extra -
phrase is used in the do ument, and so is able to present tor.
texts, relevant for that parti ular date. The generation of
these indexes and markups is slow; for the database men- 3.4 Manual annotation
tioned here it would take two or three days to re ognize the It is not very hard to extra t numbers from texts, but
phrases and add the markups, but there ertainly is room knowing what these numbers mean is quite something else.
for improvement. Considering the fa t that for example timespans an be writ-
ten in dozens of dierent ways, MBL was our rst hoi e. To
3. OPEN BOEK: MACHINE LEARNING be able to perfom MBL, a set of examples is needed. These
examples an be generated by manually annotating the ex-
7
For more information on SMART and a tutorial see: [14,
8
13℄ http://edna.itor.org/nl/
..lead 2 lead 1 numeri trail 1 trail 2.. • Modied value dieren e metri : a method of looking
boren tot 1 m onder at o-o uren e of values with target lasses to deter-
steentijd ) 8800 - 4900 mine how similar the values of a feature are, using
ook Bijlage III ) . exemplar weights [4℄.
van put 7 zijn geen
werden ook dertien fragmenten kwarts • Information Gain feature weighting: measures the dis-
rimination value , i.e. how mu h ea h isolated feature
ontributes to the orre t lassi ation.
Table 1: Examples of numeri s with leading and
trailing ontent • Five nearest neighbours used for extrapolation: ve
nearest neighbours are used for lassifying obje ts based
on losest training examples in the feature spa e.
tra ted numeri s from the ar haeologi al reports. To do so,
a web-based environment was reated. With the manual • Exponential De ay weighting with parameters a = 1.1
annotations we onstru ted a database of 25,070 instan es, and b = 1.1: gives exponentially less weight to more
ea h onsisting of nine features a ompanied by the lass distant neighbours for lassi ation.
assigned by the human annotator.
These settings were used to perform a tenfold- rossvalidation
test on the remaining data (22,563 instan es). If we were to
ode lass per . F-s ore
lassify every instan e as belonging to the biggest lass, be-
E42 Obje t identier 18.7 % 0.92
ing 'Other', our a ura y would be 40.6 per ent. Therefore,
E47 Spatial oordinates 1.2 % 0.95
we an set the baseline for our system to 40.6 per ent, as
E52 Time-span 16.0 % 0.96
every per entage higher than this number would be an im-
E54 Dimension: depth/height 2.9 % 0.86
provement in the performan e of our system.
E54 Dimension: length/width 1.7 % 0.90
E54 Dimension: diameter 0.8 % 0.85
F-S ore beta = 1 mi roav. 0.93
E54 Dimension: thi kness 0.7 % 0.79
F-S ore beta = 1 ma roav. 0.87
E54 Dimension: surfa e 1.7 % 0.62
AUC, mi roav. 0.96
E54 Dimension: volume 0.2 % 0.62
AUC, ma roav. 0.92
E54 Dimension: weight 0.2 % 0.85
overall a ura y 0.94
E54 Dimension: other 1.5 % 0.88
E60 number 7.6 % 0,90
Referen e 6.2 % 0.93
Table 3: F-S ores, Area under the ROC-Curve and
Other 40.6 % 0.95
overall a ura y obtained with TiMBL

Table 2: Breakdown in lasses with odes, per ent-


ages and F-s ores

3.5 Memory based learning


9
For the MBL we used TiMBL 5.1 [6℄ , a de ision-tree-
based implementation of k-nearest neighbour lassi ation
(KNN). KNN lassi ation is a method of lassifying ob-
je ts based on the losest training examples mapped in the
feature spa e. TiMBL uses indexes in the instan e memory
extensively and therefore an handle dis rete data and large
numbers of various examples well [6℄.
The settings that proved most e ient for a development
set onsisting of 2,507 random instan es were:

• IB1 lassi ation algorithm: a simple instan e-based


learning algorithm. It simply stores all example in- Figure 1: Information gain of the nine features

stan es and nds the losest one. IB1 uses a simi-


larity fun tion where the instan es are des ribed by With a total of 94 % of the instan es lassied orre tly,
n attributes. The algorithm is quite similar to the k- the MBL- omponent performs well above the baseline of
nearest neighbour algorithm. However, IB1 normalizes 40.6 % Implementing it is therefore a eptable. Classes
its atributes' ranges, pro esses instan es in rementally that don't perform very well are, not surprisingly, gener-
and assumes missing values to be maximally dierent ally the smallest lasses as shown in table 2. The system
from the value present. A drawba k is that the storage does not have many examples from these lasses, and thus
of all the items an take up large amounts of mem- does a poorer job at lassifying them. Another thing to take
ory [1℄. into a ount is the fa t that these lasses are all sub lasses
from 'E54 Dimension', and are more akin and therefore more
9
Available from http://ilk.uvt.nl likely to be onfused with ea h other.
As demonstrated in earlier resear h [2℄, gure 1 shows us that the proje t has two very distin t goals. The rst is
that the loser the word is to the fo us, the higher its infor- to build a text retrieval system, that in orporates modern
mation gain gets. This means that words loser to the fo us te hniques for the interpretation of ertain semanti lasses,
(in our ase a numeri ) are more important for the lassi- su h as hronology or geography. To omplete this stage,
 ation of that numeri . When at equal distan e, words we have the following tasks before us:
in front have a slightly higher information gain value than
trailing words. Something that makes our situation inter- • At the moment we are onsidering (but have not yet
esting, is the fa t that the information gain of the numeri implemented) stand-o representation [7℄, where dif-
itself is a tually lower than the information gain of its dire t ferent tag-stru tures for HTML and XML are stored
neighbours. This means that the numeri itself ontributes in separate les. This will result in faster indexing
less to its lassi ation than the words dire tly in front or and will prevent many problems with oni ting tag
in the ba k of it. These results stress the ambiguity of nu- stru tures.
meri s in ar heaologi al texts and the need to use ontext to
disambiguate the numeri s before the numeri al information • The se ond task, s heduled for early 2007 is the de-
therein an be made expli it. velopment of a system to re ognize and disambiguate
spatial referen es, and add the orre t oordinates for
3.6 Making the information explicit 11
intera tive use, in ooperation with KICH . We will
Now that we have a database with lassied examples, use mu h the same approa h as for the numeri data.
we an start to make the numeri information expli it by In fa t, the urrent system is already able to re ognize
implementing a parsing and tagging system. The text needs spatial oordinates and display the orresponding area
to be tagged with the lassi ations done by TiMBL on the using Googlemaps.
numeri s generated by our feature extra tor. This is not the
only task of our system: the tagged information needs to be • Also, an Open Sour e stemmer for Dut h should be

represented in a standardized form. sele ted and implemented, to redu e the number of

Consider the senten es van 1601 tot 1700 and In de 17- keywords.

de eeuw)('From 1601 to 1700' and 'In the 17th entury').


• The hronology and the spatial oordinates are of ob-
Although their stru tures vary, they have the same mean-
vious interest for the ar heologi al ommunity. An in-
ing. In the rst senten e, 1601 and 1700 will both have
ventory must be ompiled of other, similar wishes.
been lassied by our system as 'Timespans', whereas in the
se ond senten e only 17th will have been lassied as being • The system an also be trained on the re ognition of
a Timespan. However, we want our parser to tag both in-
other entities, su h as (of ourse) proper names.
stan es as Timespans spanning from January the rst 1601
to De ember the 31st 1700. • If and when performan e be omes a problem, indexes
This an be a hieved by looking at the ontext. In this and other data should be stored in a SQL-database.
example the word 'eeuw' (meaning entury) in the se ond Currently most index les are plain ASCII, and are
senten e is essential. For this task we programmed a rule- pro essed by the standard Unix text utilities.
based parser using a list of keywords to ombine with the
lassied instan es. It inserts HTML-tags for the lassi a- Still more interesting (and more di ult) is the nal task
tions of the numeri s and the expli it hronologi al data for set before us: to arry the interpretation of NL text to the
timespan-tags in the HTML-le. While doing this, it was a point that we an identify phrases, passages and images that
minor addition to also add histori al periods that have no refer to (ar heologi al) obje ts mentioned in the text. We
number- omponent, su h as the 'Middle Ages' or the 'Iron will des ribe our ideas and approa h to that problem in a
10
Age' with their orre t dates . In the rare o asion where dierent paper[12℄.
'1601' would have been lassied as 'Timespan' and '1700'
as something else and the tagger expe ts a 'Timespan', it
will treat '1700' as a timespan and in lude it in the tag.
5. ACKNOWLEDGEMENTS
This work was supported by NWO/CATCH under grant
1692 01/01/1692 AD - 31/12/1692 AD 640.002.401. No Mi rosoft software was used for the exper-
1674 - 1680 01/01/1674 AD - 31/12/1680 AD iments mentioned in the paper or for the preparation of the
13de en 14de eeuw 01/01/1201 AD - 31/12/1400 AD paper itself.
1 januari 1988 01/01/1988 AD - 01/01/1988 AD

6. REFERENCES
Table 4: Some examples of generated data in the [1℄ D. W. Aha, D. Kibler, and M. K. Albert.
Timespan-tags Instan e-based learning algorithms. Ma hine Learning,
7:3766, 1990.
[2℄ S. Bu hholz and A. van den Bos h. Integrating seed
4. OPEN BOEK: COMPLETION names and n-grams for a named entity list and
lassier. 2000.
We mentioned already that the re ognition of numeri val-
[3℄ C. Bu kley. Looking at limits and tradeos: Sabir
ues is only the rst step in the reation of Open Boek, and
resear h at TREC 2005. In Harman and Voorhees [10℄.
10
It must be kept in mind that the Iron Age in Western
11
Europe is a very dierent period from the Iron Age in Gree e KennisInfrastru tuur CultuurHistorie is a sister proje t of
or Egypt RICH, and spe ially interested in spatial referen es.
[4℄ S. Cost and S. Salzberg. A weighted nearest neighbor
algorithm for learning with symboli features. Ma h.
Learn., 10(1):5778, 1993.
[5℄ N. Crofts, M. Doerr, T. Gill, S. Stead, and M. Sti.
Denition of the ido on eptual referen e model 4.2.
Te hni al report, CIDOC, 2005.
[6℄ W. Daelemans, J. Zavrel, K. van der Sloot, and
A. van den Bos h. Timbl: Tilburg memory based
learner, version 5.1, referen e guide. ilk te hni al
report 04-02. Te hni al report, Tilburg University,
2004.
[7℄ S. Dipper. Xml-based stand-o representation and
exploitation of multi-level linguisti annotation. In
<BXML 2005> Berliner XML Tage 2005, 2005.
[8℄ L. Dybkjaer, editor. Pro eedings of the Se ond
International Conferen e on Language Resour es and
Evaluation (LREC 2000), 2000.
[9℄ D. Harman, editor. The Fourth Text REtrieval
Conferen e (TREC-4). National Institute of
Standards and Te hnology, 1999.
[10℄ D. Harman and E. Voorhees, editors. The Fourteenth
Text REtrieval Conferen e (TREC-14). National
Institute of Standards and Te hnology, 2005.
[11℄ J. Holmen, C.-H. Ore, and O. Eide. Do umenting two
CAA
histories at on e: digging into ar haeology. In
2003 - Computer Appli ations and Quantitative
Methods in Ar haeology. Bar International Series 1127
2004, 2003.
[12℄ J. Paijmans. Re ognizing related data in ar heologi al
reports, 2007.
[13℄ J. J. Paijmans. Tutorial for the smart information
retrieval system.
http://pi0959.kub.nl/Paai/Onderw/Smart/hands.html,
augustus 1996.
[14℄ J. J. Paijmans. Explorations in the Do ument Ve tor
Model of Information Retrieval. PhD thesis,
Katholieke Universiteit Brabant, 1999.
[15℄ G. Salton and M. J. M Gill. Introdu tion to Modern
Information Retrieval. M Graw-Hill New York [et . ℄ -
448 pp., 1983.

Vous aimerez peut-être aussi