Académique Documents
Professionnel Documents
Culture Documents
1 Description
The National Digital Archives Program1 (NDAP) Taiwan, which was launched
in 2002, has digitised over two million cultural artifacts and established more
than one hundred archives. However, the digital content is disseminated over
different institutions and is only searchable via their respective websites. Thus,
we developed the Union Catalog2 (UC) system as an integrated portal website
to provide users with a convenient means by which to search and browse all the
digital resources at the same time.
To implement such a national and permanent web portal, we consider meth-
ods of data storage in terms of better preservation and maintenance. Conven-
tionally, digital libraries have a close relationship with database systems. In our
setting, however, some factors motivate us to focus our storage mechanism on
XML technologies, instead of depending on a database system. The first factor
1
National Digital Archives Program, Taiwan, http://www.ndap.org.tw/index_en.
php
2
Union Catalog of National Digital Archives Program, Taiwan, http://catalog.
digitalarchive.tw/
is that the nature of our data is mostly XML-formatted metadata of digital re-
sources. Second, it has already been demonstrated that XML is a more suitable
format for long-term data preservation [1, 3]. Third, if all the metadata were to
be converted into a database, subsequent maintenance would incur high costs,
especially with respect to technical training and data conversion.
Metadata sources can be imported via OAI-PMH [2], XML files, or Microsoft
Excel files, after which UC will process and normalize the raw metadata before
storing it in separate XML files. Data contained in metadata can be divided
into two parts: administrative information, comprising the generation date, cat-
egories, a unique identifier, and a link back to the original database; and descrip-
tive information, which is comprised of Dublin Core’s 15 elements3 . As data is
stored in separate XML files, we do not have to worry about how to properly
design a database based on future scalability. In addition, the data can be easily
read and edited using basic computer text editors like WordPad or Notepad.
Although we believe this is a better data storage solution, it has difficulty
handling data queries, data modification, and category listing efficiently. XML
technologies can not completely overcome these difficulties, particularly under
predictable loads, where UC has to serve numerous users and manipulate myriads
of data at the same time. To solve these problems, we adopted Apache Lucene4
technology as the core search engine of UC and took advantage of the positive
functions of a database system. Below we provide detailed descriptions of the
Lucene and database solutions.
Acknowledgements
This work was supported by the National Science Council of Taiwan under
Grant: NSC 96-3113-H-001-015.
References
1. Haag, D.: Digital Preservation Testbed White Paper – XML and Digital
Preservation. (2002), http://www.digitaleduurzaamheid.nl/bibliotheek/docs/
white-paper_xml-en.pdf
2. Lagoze, C., Van de Sompel, H., Nelson, M., Warner, S.: The Open Archives Initia-
tive Protocol for Metadata Harvesting-Version 2.0. Open Archives Initiative (2002),
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
3. Müller, E., Klosa, U., Hansson, P., Andersson, S., Siira, E.: Using XML for
Long-term Preservation: Experiences from the DiVA Project. The Sixth Interna-
tional Symposium on Electronic Theses and Dissertations, the Humboldt-University
in Berlin, Germany (2003), http://edoc.hu-berlin.de/conferences/etd2003/
hansson-peter/HTML/