Académique Documents
Professionnel Documents
Culture Documents
in Archeological Reports
1
Rijksdienst voor Ar
heologie, Cultuurlands
hap en Monu-
Permission to make digital or hard copies of all or part of this work for menten
personal or classroom use is granted without fee provided that copies are 2
Continuous A
ess to Cultural Heritage
not made or distributed for profit or commercial advantage and that copies 3
Reading Images in the Cultural Heritage
bear this notice and the full citation on the first page. To copy otherwise, or 4
Mining for Information in Texts from the Cultural Heritage
republish, to post on servers or to redistribute to lists, requires prior specific 5
KennisInfrastru
tuur CultuurHistorie
permission from the author. 6
DIR 2007, March 28-29, Leuven, Belgium. The te
hni
al report on Open Boek
an be found online as
Copyright 2007 Hans Paijmans and Sander Wubben. http://www.referentie
olle
tie.nl/Openboek/ob_rapport.pdf
To start with, the system should be both immediatly usable 3.1 The corpus
from the beginning of the proje
t, and remain open for ad- Considering we are applying Natural Language Pro
ess-
ditions and
hanges. Therefore we adopted a tight modular ing te
hnology (NLP) in the eld of ar
haeology, we need
approa
h, where the modules ('experts')
ommuni
ated in to have data that is relevant for this spe
i
domain. To
ASCII les, using Unix text tools where possible. This may obtain this data, we sele
ted 75 ar
haeologi
al reports at
have had some impa
t on performan
e, but makes it easier 8
random from the site E-Depot Nederlandse Ar
heologie ,
to inspe
t intermediary results. Also, it invites experiment- dealing with ar
haeologi
al ex
avations and histori
al re-
ing with dierent modules that have similar fun
tionality. sear
h. Together, these 75 reports
ontain 420,369 words
Then there is the format of the original do
uments. Al- and pun
tuation marks.
though some do
uments in the
olle
tion are written in Word,
it was found that this format is not amendable to manip- 3.2 The classes
ulation by third party developers. pdf-les are, and as the
To automati
ally make expli
it the information embedded
majority of the do
uments was available in pdf, we de
ided
in numeri
s in these ar
haeologi
al reports, we rst need
to limit ourselves to that format.
to extra
t and
lassify these numeri
s. Before we
an do
As a rst step, the pdfs are
onverted to individual HTML-
this, we should dene the
lasses we want to be found. The
pages and separate images, keeping as mu
h of the original
CIDOC/CRM model [5℄, a well-known standard in the eld
layout as possible. Then, the text proper was extra
ted from
of
ultural heritage, was used as a referen
e. A table that
the HTML. One typi
al database, of 750 pdf do
uments,
shows the
lasses we sele
ted is in 2.
yielded 50 MB text,
ontained in 30.000 pages and as many
The majority of the numeri
s we found
an be tted into
images. At retrieval time, the HTML pages are displayed,
the CIDOC model. The remaining items proved to be nu-
but a link to the original pdf-le is at all times present. Of
meri
s only relevant inside the individual do
ument they
ourse, it is not ne
essary to have a pdf le; the do
uments
were in, su
h as image numbers and page numbers; these
an be delivered in HTML dire
tly, but most do
uments in
are found in the
lasses 'Referen
e' and 'Other'.
databases like we envisage are in pdf format anyway.
represented in a standardized form. sele ted and implemented, to redu e the number of
Consider the senten es van 1601 tot 1700 and In de 17- keywords.
6. REFERENCES
Table 4: Some examples of generated data in the [1℄ D. W. Aha, D. Kibler, and M. K. Albert.
Timespan-tags Instan
e-based learning algorithms. Ma
hine Learning,
7:3766, 1990.
[2℄ S. Bu
hholz and A. van den Bos
h. Integrating seed
4. OPEN BOEK: COMPLETION names and n-grams for a named entity list and
lassier. 2000.
We mentioned already that the re
ognition of numeri
val-
[3℄ C. Bu
kley. Looking at limits and tradeos: Sabir
ues is only the rst step in the
reation of Open Boek, and
resear
h at TREC 2005. In Harman and Voorhees [10℄.
10
It must be kept in mind that the Iron Age in Western
11
Europe is a very dierent period from the Iron Age in Gree
e KennisInfrastru
tuur CultuurHistorie is a sister proje
t of
or Egypt RICH, and spe
ially interested in spatial referen
es.
[4℄ S. Cost and S. Salzberg. A weighted nearest neighbor
algorithm for learning with symboli
features. Ma
h.
Learn., 10(1):5778, 1993.
[5℄ N. Crofts, M. Doerr, T. Gill, S. Stead, and M. Sti.
Denition of the
ido
on
eptual referen
e model 4.2.
Te
hni
al report, CIDOC, 2005.
[6℄ W. Daelemans, J. Zavrel, K. van der Sloot, and
A. van den Bos
h. Timbl: Tilburg memory based
learner, version 5.1, referen
e guide. ilk te
hni
al
report 04-02. Te
hni
al report, Tilburg University,
2004.
[7℄ S. Dipper. Xml-based stand-o representation and
exploitation of multi-level linguisti
annotation. In
<BXML 2005> Berliner XML Tage 2005, 2005.
[8℄ L. Dybkjaer, editor. Pro
eedings of the Se
ond
International Conferen
e on Language Resour
es and
Evaluation (LREC 2000), 2000.
[9℄ D. Harman, editor. The Fourth Text REtrieval
Conferen
e (TREC-4). National Institute of
Standards and Te
hnology, 1999.
[10℄ D. Harman and E. Voorhees, editors. The Fourteenth
Text REtrieval Conferen
e (TREC-14). National
Institute of Standards and Te
hnology, 2005.
[11℄ J. Holmen, C.-H. Ore, and O. Eide. Do
umenting two
CAA
histories at on
e: digging into ar
haeology. In
2003 - Computer Appli
ations and Quantitative
Methods in Ar
haeology. Bar International Series 1127
2004, 2003.
[12℄ J. Paijmans. Re
ognizing related data in ar
heologi
al
reports, 2007.
[13℄ J. J. Paijmans. Tutorial for the smart information
retrieval system.
http://pi0959.kub.nl/Paai/Onderw/Smart/hands.html,
augustus 1996.
[14℄ J. J. Paijmans. Explorations in the Do
ument Ve
tor
Model of Information Retrieval. PhD thesis,
Katholieke Universiteit Brabant, 1999.
[15℄ G. Salton and M. J. M
Gill. Introdu
tion to Modern
Information Retrieval. M
Graw-Hill New York [et
. ℄ -
448 pp., 1983.