Vous êtes sur la page 1sur 4

An XML-Centric Storage for Better

Preservation and Maintenance of Data: Union


Catalog of NDAP, Taiwan

Tzu-Yen Hsu, Ting-Hua Chen, Chung-Hsi Hung, and Sea-Hom Chou

Institute of Information Science, Academia Sinica, Taipei 115, Taiwan


{ciyan, james, johan, seahom}@iis.sinica.edu.tw

Abstract. The Union Catalog (UC) of Taiwan was established to pro-


vide an integrated search service for millions of digital objects distributed
in the databases of different institutions. The main challenge is how
to continuously and consistently manage large quantities of data. XML
technologies have already been recommended for greater data preser-
vation rather than database systems. In addition, we assume that a
database design in our case would be complex and that consistent main-
tenance would be difficult. For this reason, databases are not used as
the primary storage mechanism of the UC. Although the UC adopts an
XML-centric architecture, it has difficulty handling data queries, data
modification, and category listing efficiently. In this paper, we discuss
how we use XML technologies to implement the UC system, and how we
solve the issues arising from XML’s limitations.

Key words: NDAP, architecture, digital library

1 Description
The National Digital Archives Program1 (NDAP) Taiwan, which was launched
in 2002, has digitised over two million cultural artifacts and established more
than one hundred archives. However, the digital content is disseminated over
different institutions and is only searchable via their respective websites. Thus,
we developed the Union Catalog2 (UC) system as an integrated portal website
to provide users with a convenient means by which to search and browse all the
digital resources at the same time.
To implement such a national and permanent web portal, we consider meth-
ods of data storage in terms of better preservation and maintenance. Conven-
tionally, digital libraries have a close relationship with database systems. In our
setting, however, some factors motivate us to focus our storage mechanism on
XML technologies, instead of depending on a database system. The first factor
1
National Digital Archives Program, Taiwan, http://www.ndap.org.tw/index_en.
php
2
Union Catalog of National Digital Archives Program, Taiwan, http://catalog.
digitalarchive.tw/
is that the nature of our data is mostly XML-formatted metadata of digital re-
sources. Second, it has already been demonstrated that XML is a more suitable
format for long-term data preservation [1, 3]. Third, if all the metadata were to
be converted into a database, subsequent maintenance would incur high costs,
especially with respect to technical training and data conversion.
Metadata sources can be imported via OAI-PMH [2], XML files, or Microsoft
Excel files, after which UC will process and normalize the raw metadata before
storing it in separate XML files. Data contained in metadata can be divided
into two parts: administrative information, comprising the generation date, cat-
egories, a unique identifier, and a link back to the original database; and descrip-
tive information, which is comprised of Dublin Core’s 15 elements3 . As data is
stored in separate XML files, we do not have to worry about how to properly
design a database based on future scalability. In addition, the data can be easily
read and edited using basic computer text editors like WordPad or Notepad.
Although we believe this is a better data storage solution, it has difficulty
handling data queries, data modification, and category listing efficiently. XML
technologies can not completely overcome these difficulties, particularly under
predictable loads, where UC has to serve numerous users and manipulate myriads
of data at the same time. To solve these problems, we adopted Apache Lucene4
technology as the core search engine of UC and took advantage of the positive
functions of a database system. Below we provide detailed descriptions of the
Lucene and database solutions.

– Search engine : Apache Lucene is a free, full-text search engine library.


When metadata is imported, the UC system indexes the full text, except
the administrative part, of the metadata into a Lucene document. Accord-
ingly, the number of Lucene documents should be equivalent to the total of
all metadata sources. For data consistency, we do not make any direct mod-
ification of the indexed metadata, but re-import revised metadata sources.
When users press the search button, UC will search in the Lucene indexed
documents, not in the XML files or the database.
– Efficient category listing and data modification : Database systems
are usually responsible for efficient data insertion, updating, and retrieval.
For this reason, we store category records and frequently modified manage-
ment information in a database. We designed three tables for this purpose:
a table for storing category records; a table for storing the management in-
formation of each metadata file, such as the unique identifier (URN) and
the modification date; and a table that links the preceding two tables. For
services based on the category table, users can also browse UC through a
list of subjects or a list of participant organizations.

In compliance with the notion of easy maintenance, UC only has to ensure


the data integrity of XML-formatted metadata, from which other data can be
produced for web services or further applications like Lucene documents and
3
Dublin Core Metadata Element Set, http://dublincore.org/documents/dces/
4
Apache Lucene, http://lucene.apache.org/java/docs/index.html
database tables. To respond to requests for metadata on the Web, we use XML
transformation technologies (XSLT) to facilitate rapid delivery of the data. This
is similar to fetching a static HTML file, without making burdensome database
queries in entity-relationship tables. The data storage architecture of UC and its
relationship with services are shown in Fig.1

Fig. 1. Data storage architecture and services of UC.

Acknowledgements

This work was supported by the National Science Council of Taiwan under
Grant: NSC 96-3113-H-001-015.
References
1. Haag, D.: Digital Preservation Testbed White Paper – XML and Digital
Preservation. (2002), http://www.digitaleduurzaamheid.nl/bibliotheek/docs/
white-paper_xml-en.pdf
2. Lagoze, C., Van de Sompel, H., Nelson, M., Warner, S.: The Open Archives Initia-
tive Protocol for Metadata Harvesting-Version 2.0. Open Archives Initiative (2002),
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
3. Müller, E., Klosa, U., Hansson, P., Andersson, S., Siira, E.: Using XML for
Long-term Preservation: Experiences from the DiVA Project. The Sixth Interna-
tional Symposium on Electronic Theses and Dissertations, the Humboldt-University
in Berlin, Germany (2003), http://edoc.hu-berlin.de/conferences/etd2003/
hansson-peter/HTML/

Vous aimerez peut-être aussi