Vous êtes sur la page 1sur 73

Mediator-based Integration of Web Sources (MIWeb) Case Study e-Learning

Dr. Susanne Busse, Thomas Kabisch, Ralf Petzschmann

Forschungsberichte der Fakultt IV a Elektrotechnik und Informatik No. 2005-02 ISSN 1436-9915

Computation and Information Structures (CIS) Berlin University of Technology

April 2005

Contents
1 Introduction 2 MIWeb @ e-Learning 2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.1.2 2.1.3 2.2 Structural Overview . . . . . . . . . . . . . . . . . . . . . Mediator Schema . . . . . . . . . . . . . . . . . . . . . . . Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 4 4 4 6 9 12 12 16 17 18 19 19 21 22 22 23 24 27 28 29 29

Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 Common Data Model: RDF . . . . . . . . . . . . . . . . . Component Technologies . . . . . . . . . . . . . . . . . .

3 Wrapping of Web Sources 3.1 3.2 Classication of Information Sources . . . . . . . . . . . . . . . . General Web Wrapping Architecture . . . . . . . . . . . . . . . . 3.2.1 3.2.2 3.3 Components . . . . . . . . . . . . . . . . . . . . . . . . .

Source Repository . . . . . . . . . . . . . . . . . . . . . .

Query Tunneling . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 3.3.2 3.3.3 3.3.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . Introducing Examples . . . . . . . . . . . . . . . . . . . . Query Relaxation . . . . . . . . . . . . . . . . . . . . . . . Result Restriction . . . . . . . . . . . . . . . . . . . . . .

3.4

MIWeb Wrapping Components . . . . . . . . . . . . . . . . . . . 3.4.1 3.4.2 The Google / Roodolf Wrapper . . . . . . . . . . . . . . . The Citeseer Wrapper . . . . . . . . . . . . . . . . . . . .

4 Mapping of RDF Data 4.1 Model Correspondences . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 4.1.2 4.1.3 4.1.4 4.2 4.3 The Idea of Model Correspondences . . . . . . . . . . . . Metamodel . . . . . . . . . . . . . . . . . . . . . . . . . . Model Correspondence for NewEconomy . . . . . . . . . . Model Correspondence for CiteSeer . . . . . . . . . . . . .

31 31 31 33 34 36 37 39 40 40 42 43 44 45 45 46 48 49 49 51 55 55 57 62 64

Interface of the Mapper Component . . . . . . . . . . . . . . . . Design of the Mapper Component . . . . . . . . . . . . . . . . . 4.3.1 4.3.2 Managing Model Correspondences . . . . . . . . . . . . . Transforming RDF Data . . . . . . . . . . . . . . . . . . .

5 Mediator-based Integration 5.1 5.2 5.3 Design of the Mediator Component . . . . . . . . . . . . . . . . . Managing Query Capabilities . . . . . . . . . . . . . . . . . . . . Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 5.3.2 5.3.3 Data Structure for Query Processing . . . . . . . . . . . . Query Planning . . . . . . . . . . . . . . . . . . . . . . . . Plan Execution and Result Integration . . . . . . . . . . .

6 Conclusion 6.1 6.2 Experiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion of the MIWeb Architecture . . . . . . . . . . . . . . .

A Model Correspondences A.1 XML Schema for Model Correspondences . . . . . . . . . . . . . A.2 NewEconomy Mediator . . . . . . . . . . . . . . . . . . . . . . A.3 Citeseer Mediator . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography

ii

List of Figures
1.1 2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.5 5.1 5.2 Mediator-based Information Systems (MBIS) . . . . . . . . . . . Architecture of MIWeb@e-Learning . . . . . . . . . . . . . . . . . Overview on the Mediator Schema . . . . . . . . . . . . . . . . . Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Query Processing Example 1 . . . . . . . . . . . . . . . . . . . Query Processing Example 2 . . . . . . . . . . . . . . . . . . . Languages in MIWeb@e-Learning . . . . . . . . . . . . . . . . . . Component Technologies . . . . . . . . . . . . . . . . . . . . . . . General Wrapping Architecture . . . . . . . . . . . . . . . . . . . Query Tunneling . . . . . . . . . . . . . . . . . . . . . . . . . . . CiteSeer Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . CiteSeer Result Snippet . . . . . . . . . . . . . . . . . . . . . . . An Example RDF Result . . . . . . . . . . . . . . . . . . . . . . Snippet Example from CiteSeer . . . . . . . . . . . . . . . . . . . Citeseer Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . Structure of Model Correspondences . . . . . . . . . . . . . . . . Interfaces of the Mapper Component . . . . . . . . . . . . . . . . Design of the Mapper Component . . . . . . . . . . . . . . . . . Transformation of RDF Data . . . . . . . . . . . . . . . . . . . . Strategy Pattern for Transforming RDF Data . . . . . . . . . . . Design of the Mediator Component . . . . . . . . . . . . . . . . . Public Methods of the QCManager and Mediator Classes . . . . 2 5 7 9 11 12 13 16 19 23 23 23 28 30 30 33 37 39 40 41 43 44

iii

5.3 5.4 5.5

Class Structure of the qc datastructure Package . . . . . . . . . . Detailed View of the Prolog Planning Algorithm . . . . . . . . . Overview of the Prolog Planning Algorithm . . . . . . . . . . . .

45 46 48

iv

Chapter 1

Introduction
We all know the problems when searching the World Wide Web. Everyone will already have grappled with the variety of search forms and unsuitable results from search engines. The reasons for this situation are caused by the idea of an open, world-wide net with all its corresponding challenges: Web sources are heterogeneous: They are based on dierent data models, dier in their data structures and semantics, and provide dierent interfaces. Web sources are autonomous: They can change independently from each other and from systems using them. There exists no common ontology that can be used for querying. Therefore common search engines usually provide simple query schemas with keywords or titles. Web data usually comes without any semantic information describing the data so that all interpretations have to be done manually a computerbased interpretation is impossible. Tackling these problems requires concepts from dierent research elds, all gathered within the activities on the semantic web ([URLi, URLj, BLHL01, SM02]). Both database concepts and information retrieval techniques mainly address the search and integration of data with dierent semantics ([Ull97, WVV+ 01]). The transfer of concepts of federated database systems ([SL90, Con97]) to the context of the World Wide Web results in the denition of mediator-based information systems (MBIS). A MBIS (see Figure 1.1) oers a homogeneous, virtual, and read-only access mechanism to a dynamically changing collection of heterogeneous, autonomous, and distributed data sources ([Wie97, BKLW99]). The user does not have to know how to combine data sources and how to integrate the results the system encapsulates the heterogeneity and provides a exible search interface using a query language and a schema as a common ontology. The main software components of a MBIS are wrappers, which encapsulate sources and solve technical and 1

User / Global Applications

Mediation
Mediator

Mediator

Local Applications

Wrapping

Data Sources

Figure 1.1: Mediator-based Information Systems (MBIS) data model heterogeneity, and mediators, which integrate data from dierent sources resolving logical heterogeneity. The key issue of a MBIS is its mediator schema as it denes the universe of discourse of the system. Lacks in its denition will result in inappropriate search functionality and unsuitable query results. Therefore, a precise domain-specic denition will bring the best integration results. But you also have to consider the broad range of web sources and the eorts building the MBIS: It is necessary to dene a comprehensive mediator schema and to build wrappers integrating web sources that do not provide much semantic information about their data. Therefore, metadata standards are used in the context of the web that are organized in hierarchies: the Dublin Core [URLc, DCMI99] as a top ontology denes a small set of metadata attributes to describe web resources in general, domain-specic standards like the FGDC [FGDC98] or the Learning Object Metadata standard [Com02] provide more precise description capabilities for domain-specic resources. Both the quality of data integration and the eort to integrate a web source correspond to the level of precision chosen for the mediator schema. Sources that use a dierent schema as the mediator are integrated by dening correspondences describing mappings between the schemas ([SP91, Bus02]). In particular, they can be used to integrate sources of related domains that are based on other metadata standards. The explicit specication of correspondences as system metadata allows us to integrate and change sources during the runtime of the system a prerequisite for the integration of autonomous web sources. The main diculty is the denition of the mediator schema as it has to cover all aspects of the whole system.

MIWeb Activity
Our research group has worked on mediator-based information systems for many years. Our work focuses on metadata-based integration using correspondences ([Les00, Bus02, BP01]), the encapsulation of semi-structured data sources ([Kab03, GKS04]), and on methods for the design and evolution of MBIS ([JKR03, SBKK03, BK04]). The MIWeb activities apply our research results on the specic context of the World Wide Web. Thereby, we consider the following questions (among others): How can we provide a exible access to domain-specic information over the web that goes beyond a simple keyword search? How can we use and combine existing metadata standards for this purpose? How can we integrate web sources related to the considered domain that provide a smaller query interface and improve their search interface by this way? How can we support the dynamically integration of new sources? Are the web technologies, particularly the Resource Description Framework and RDF query languages sucient for building mediator-based web information systems? This case study is on a mediator-based information system about learning material. The rst prototype was developed in a students project in summer 2003 and revised last year. The system provides information for students and teachers about electronically available learning resources and related publications. The mediator schema is based on the (domain-specic) Learning Object Metadata (LOM) standard. We have integrated two domain-specic e-Learning web sources, the scientic citation index Citeseer, and a general web search engine to examine all the problems described above. The remaining case study is structured as follows: Chapter 2 gives an overview on the architecture and the mediator schema and shows the use of the system. Chapter 3 discusses the wrapping of web sources. In particular, it addresses the challenge of integrating web sources with a wide range of information into a domain-specic mediator. Chapter 4 addresses the integration of data with heterogeneous structures and semantics using correspondences. It proposes a rule-based mapper component transforming RDF data of dierent schemas. Chapter 5 focuses on the mediator. It discusses the problems of query processing and integration of query results. Chapter 6 summarizes our experiences on applying mediator concepts on the context of the web and identies challenges for future work.

Chapter 2

MIWeb @ e-Learning
Firstly, we will give an overview on the architecture of the system, the underlying mediator schema, and the processes of querying and managing the system. Secondly, we will introduce the technologies used to build the system, in particular the Resource Description Framework (RDF) and RDQL a query language on RDF.

2.1
2.1.1

Architecture
Structural Overview

The MIWeb system provides information on e-Learning resources. Consequently, it integrates web sites of this specic domain: NE (NewEconomy) ([URLf, LGH02]) and DBS (Database Systems) are sources containing metadata of learning objects in the information technology eld. In addition, MIWeb automatically connects this information to related web documents using more general metadata sources: the scientic citation index Citeseer [URLb] and the search engine Google [URLd]. This way, almost any web document can be searched and interpreted from the e-Learning point of view. The integration of these sources follows a mediator architecture. It consists of three main components (see Figure 2.1): Mediator The mediator oers the access point to the MIWeb system. It provides a read-access to the integrated sources based on the Learning Object Metadata standard (LOM) [Com02] that is used to describe e-Learning resources. Users can query the system using an SQL like query language. The mediator is responsible for answering queries against the mediator schema. This includes generating plans for querying the integrated wrappers so that the global query can be answered (query rewriting), executing these queries by communicating with the wrappers and 4

integrating the results by eliminating redundancies and identifying data conicts. Query planning is based on the descriptions of the wrapper interfaces the query capabilities. Therefore the mediator component also includes a manager for registering, changing, and deleting query capabilities. It is used to dynamically integrate data sources into the system.
User Interface

Mediator

Query Capabilities (Service Descriptions)

GoogleWrapper

CiteSeerWrapper

NEWrapper

DBSWrapper

Roodolf
Google QEL/RDF Interface

CiteSeer
ResearchIndex

NE
Learning object Metadata

DBS
Learning object Metadata

Google
search engine

Mapper Mappings

Figure 2.1: Architecture of MIWeb@e-Learning Wrapper Each source needs to be wrapped in order to make it compatible to the demands of the mediator. Wrapping brings the ability to cope with semantical and syntactical heterogeneity and to transform dierent protocols into the correct mediator standards (technical heterogeneity). The solution of the last two tasks is done by source-specic wrapper components, the rst one by a mapper component, which is used by the wrapper. The source-specic wrapper functionality depends on the kind of source which should be wrapped: Structured sources (e.g. SQL databases) and semistructured sources like XML sources can be queried by a higher order query language. The wrapper has to transform queries and results between the mediator language and the language of the specic source. Unstructured sources (e.g. HTML pages) are more dicult to handle: These sources have no schema and usually provide no higher order query language. They are designed for human interaction. Thus, a wrapper for unstructured sources rstly needs to mine a schema. This is usually done manually by reverse modelling of the source. Secondly, the wrapper is responsible to handle queries in a higher

order query language on this schema. In Chapter 3 we will show how query tunneling can be used for this task. The domain-specic sources integrated in the MIWeb system are structured sources. The search engine Google we can also handle like a structured source because there exists a QEL/RDF wrapper named Roodolf ([URLh]). Only the CiteSeer wrapper needs to mine a schema. This data extraction task is done with a grammar-based approach, which is suitable for many kinds of unstructured sources. Mapper The mapper is responsible for resolving semantical and structural heterogeneity between mediator and wrapper. In our system it is used to transform data of a source-specic schema into LOM-compatible data. As you can see in Figure 2.1 only two of the integrated sources the DBS and the Google wrapper do not need any transformations. The transformation is based on correspondences or mappings. Similar to the query capabilities of the mediator, such mappings are specied explicitly to allow changes and extensions. The mapper component provides an interface for managing the mappings. Besides of the query API the MIWeb system contains a web-based user interface to demonstrate how user-friendly query forms can be built upon the MIWeb mediator. A search in the MIWeb system shortly contains the following steps: the query entered by the user is passed by the user interface component to the mediator. The mediator determines how to divide the query into sequences (plans) of subqueries to sources (wrappers) using the registered query capabilities. These subqueries are sent to the wrappers. The wrapper executes the subquery and then transforms the result into a LOM representation using the mapper component. The mediator collects all pieces of information delivered by the wrappers, and integrates them to one single result list. This is sent back to the user interface that displays it in a human-readable form. The following sections will give some examples of this process in more detail. Summarizing, the most important thing to note on the MIWeb architecture is its metadata-based approach. By explicitly managing query capabilities and mappings we are able to integrate new sources or to change existing ones during the runtime of the system.

2.1.2

Mediator Schema

The mediator of our system uses the Learning Object Metadata Standard (LOM, [Com02]) as its schema.

Learning Object Metadata


Figure 2.2 shows the LOM structures. The main concept is the learning object, respectively its description (metadata). 6

Besides of a general content description of a learning object containing its title, a set of keywords etc. the metadata set also includes some information specic for e-Learning material. For example the Educational part of the metadata includes a description of the intended reader of this learning object, the learning time or the level of interaction between learner and system. Similarly, the LifeCycle part contains attributes for the validation process of learning objects. In addition to the content description, the metadata set contains a technical description as well as conditions and rights that have to be considered when using the learning object (for example costs and copy rights).

Learning Object (Metadata) ________________________________________________ General Educational Title InteractivityType Subject: set(string) TypicalLearningTime AggregationLevel ... ... Rights LifeCycle Cost Creator ... ... Annotation Technical Citations Format ... ...

Relation HasPart / PartOf Source References ...


*

Figure 2.2: Overview on the Mediator Schema The description of a learning object can also include some relations to other learning objects. The most popular relations are shown in the diagram. In particular, the hasPart and partOf relationships are used to structure learning material into courses, lectures, and single learning units. The references relationship allows us to dene links between learning objects themselves as well as between a learning object and other documents.

Publications in the Mediator Schema


The mediator schema of the MIWeb system comprehends all documents as learning resources that can be described with a learning object metadata set. Thus, all documents can be treated the same way and it is easy to connect them. But obviously, the automatically extracted description of a publication will contain much less attributes than the manually given description of a specic lecture out of a learning material collection. Other educational attributes are xed: for example there will never be an interaction between reader and publication. Consequently, the MIWeb system will operate on metadata objects with a quite dierent number of attributes dependent of the kind of the resource. Dening a mediator schema based on an existing metadata standard nearly always needs some extensions. So in the case of the MIWeb schema. There exists no metadata attribute to specify the number of citations of a publication an important metadata to estimate its quality. We use the Annotation facility

of the LOM standard to add this information to a metadata description.

Additional Denitions of the Mediator Schema


The schema of a database application usually contains the specication of primary keys for the dened objects. In the context of mediator-based systems the denition of such keys are much more dicult because they are dened over a set of heterogeneous data sources. But one of the nest feature of a mediator is its capability to join and aggregate data objects from dierent sources. To do this, it needs some attributes that can be used to identify data objects that are related to the same object in the real world. These attributes are a kind of key over the data sources. If you miss their specication the mediator is only able to collect data from the dierent sources but it cannot integrate them. To enable the mediator to integrate data it uses a similar concept to primary keys in central database applications: fusion attributes. They are used to establish connections between data objects from dierent sources. The integration quality of a mediator directly corresponds to the denition of such fusion attributes. False matches result in wrongly integrated objects on mediator level, not identied matches lead to redundant objects. Note, that the user only sees the integration result without information on the data sources. Whereas redundancies in a query result are (only) annoying, false matches can lead to really wrong results. Therefore, the specication of fusion attributes should be done with respect to possible consequences on data integration results. Within the MIWeb system the following attributes of the schema are characterized as fusion attributes: the URI of a learning object the title string of a learning object In our experience, the URI is very useful to nd redundancies of e-Learning resources because they are stored only once. In contrast, there often exists several copies of publications and other documents in the web. Thus, queries on documents in general (that are not the intended use of the MIWeb system!) can deliver redundant query results. The title string is useful to nd some more information on documents referenced in e-Learning resources. Short titles perhaps produce wrong matches but in our tests this does not occur many times. In addition, we think that for our purposes wrong matches are not as worse as in other domains, for example in domains that uses such systems for decision-making.

2.1.3

Use Cases

With regard to the MIWeb system we have to consider the main use case of searching as well as some administration use cases (see Figure 2.3):
MIWeb @ eLearning

Search of Learning Material :User

Integration / Deletion of a Data Source


<<includes>> <<includes>>

:WrapperAdmin

Registration / Removal at the Mediator

Registration / Removal at the Mapper

Configuration of the Mediator :MediatorAdmin

Figure 2.3: Use Cases Search Searching data about e-Learning resources is the main use case from the users point of view. He triggers the system by formulating a query. The system answers the query based on the integrated sources. We will give two examples for this query rewriting and result integration process later. A query result consists of a list of learning object metadata sets that fulll all conditions dened in the query of the user. But some of the attributes the user looks for may be missing in the result, i.e. we do not require that a metadata set contains all of the result attributes of the query. This diers from the query semantics of RDQL ([URLg]), but we think that it is more appropriate to a mediator-based information system if the user gets all information that is available even if it is not complete. Integration / Deletion of a data source A data source, respectively its wrapper, is integrated into the MIWeb system by registering its query capabilities at the mediator. The data source will then be considered in query processing. Similarly, a data source can be deleted by deleting its query capabilities. If the data source does not support the mediator schema the wrapper can use the mapper component to do the transformation between wrapper schema and mediator schema. Then, the integration also includes the registration of mapping rules at the mapper. Conguration of the mediator The mediator uses two kinds of metadata: registered query capabilities 9

enable the mediator to rewrite a given query, specied fusion attributes enable it to connect data sources and to integrate results. Whereas query capabilities are registered and deleted by the wrapper administrators, the fusion attributes are dened on mediator level. Thus, the administration of the mediator particularly consists of the specication of fusion attributes.1 In the following, we will discuss the Search use case: As the MIWeb system provides a query language, nearly any query on the mediator schema can be formulated. Here are some examples: I am looking for e-Learning resources about transactions that contain some exercises. Which documents are referenced by a specic learning object (given by its URL) and how often are these documents cited? I want to get an overview on transactions. Thus, give me e-Learning resources that I can look through in 30 minutes. When searching for e-Learning resources please consider that my computer cannot handle Flash les. I am looking for a lecture on database systems that includes learning units on SQL and relational database design. To give an impression of the query processing within the MIWeb system we discuss the rst two examples in more detail.

Scenario 1
In the rst example we are looking for e-Learning resources about transactions that contain some exercises. Query. Get all available metadata about e-Learning objects whose subject contains the keyword transaction and that contain at least one exercise. Query Processing. The mediator determines two simple query plans (see Figure 2.4): one will send the original query to the NE wrapper, the other will send it to the DBS wrapper. The other sources do not contain any specic information about e-Learning objects so that they cannot be used for the selection of exercises. After querying the e-Learning-specic data sources the mediator integrates the results and nally sends them back to the user. This example shows the integration of results of dierent sources: redundant data (identied by equal URLs) that is stored both in the DBS source and in the NE source will be eliminated in the integration step.
1 We will see in Chapter 5 that the mediator needs additional control elements for its internal query processing. These elements are also managed as metadata.

10

:Mediator

:NE-Wr.

:DBS-Wr.

:User
Search exercises about "transactions" query planning Search exercises about "transactions"

metadata of exercises... Search exercises about "transactions"

metadata of exercises about...

result integration metadata of exercises about "transactions"

Figure 2.4: Query Processing Example 1

Scenario 2
In the second example we ask for documents that are referenced by a specic learning object that is given by its URL. In particular, we are interested in the number of citations of the referenced documents. Going into detail, we could specify the following query. Query. Give the URL, the title and the number of citations of learning objects that are referenced by the learning object with the URL http://.../postrelational.html. Query Processing. The MIWeb system determines the following query plan by analyzing the query capabilities of the registered wrappers (see Figure 2.5): Firstly, the DBS wrapper is used to get titles of learning objects that are referenced by the given object. The NewEconomy source cannot be used as it has no information about referenced objects. Secondly, the mediator calls the CiteSeer wrapper and the Google wrapper to get more metadata about the referenced documents, in particular the number of citations. This step is needed as the DBS wrapper only stores the title of referenced documents. The results from CiteSeer and Google are integrated by the mediator and sent back to the user. This example demonstrates the combination of data sources done by the mediator. As the title is a fusion attribute, the mediator can combine the query capabilities of the DBS wrapper with these of the other sources to get the information queried by the user. 11

:Mediator

:DBS-Wr.

:Google-Wr.

:CiteSeer-Wr.

:User

Search referenced docs of url="..." query planning Search refTitles of url="..." metadata of url with refTitles

Search metadata of title=refTitle

metadata of docs with refTitle Search metadata of title=refTitle

metadata of docs with refTitle result integration metadata of referenced docs

Figure 2.5: Query Processing Example 2

2.2

Technologies

We discuss the data model representing the LOM data as well as the technologies used for realizing the components of the MIWeb system.

2.2.1

Common Data Model: RDF

Besides of the mediator schema a mediator-based information system denes a data model to represent the schema. It is called the common data model [SCGS91] as it has to be used both by all sources integrated into the system and by the user accessing the system. Heterogeneity in data models used by sources has to be solved on wrapper level. Although the denition of a common data model supported by all wrappers enables the mediator to integrate data without grappling with heterogeneous data models the mediator component also depends on it. The query language is based on the data model as well as the integration functions realized by the mediator. Thus, from a technical point of view choosing the data model is one of the most important decisions within the design of a mediator-based information system. The MIWeb system is based on the Resource Description Framework (RDF) as its common data model and RDQL as its query language. A brief introduction is given below. Figure 2.6 shows the languages that the components use to

12

User Interface
RDQL-LOM RDF-LOM

Query Capabilities (Service Descriptions)

Mediator

RDQL-LOM

RDF-LOM

QEL Google RDF Google

GoogleWrapper

CiteSeerWrapper
Keyword HTML

NEWrapper NE

DBSWrapper DBS

Roodolf
Google QEL/RDF Interface

CiteSeer
ResearchIndex

Google
search engine RDF-Citeseer RDF-LOM

Mapper
Mappings

Figure 2.6: Languages in MIWeb@e-Learning communicate. All wrappers have to support RDQL queries and to represent their result data in RDF. You can see that the CiteSeer source provides its data in HTML format. Thus, the CiteSeer wrapper has to transform the HTML into an appropriate RDF data according to the mediator schema LOM. The existing Google wrapper Roodolf supports RDF but uses another RDF query language QEL ([NWQ+ 02]). Thus, the Google wrapper has to do a query transformation from RDQL to QEL.

The Resource Description Framework RDF


RDF is a W3C standard for describing any resources ([W3C04a, W3C04d, W3C04b, W3C04c]). RDF uses the XML syntax to specify statements on resources by tuples (S, P, O): S is a subject, the resource to be described. P is a predicate that states a property of interest, O stands for an object, which is either a literal or another resource. Resources and properties are identied by a URI. But you can also use anonymous resources if no URI is available (for example for collection nodes). The following example shows a part of the description of an e-Learning resource on postrelational databases and one referenced paper. The URI is given by the rdf : about attribute. The properties describe the attributes of an e-Learning resource using the LOM standard. We follow the RDF representation of the LOM standard that is described in [NPB03]. It uses top-level ontologies like the Dublin Core if possible to facilitate the interoperation with other metadata standards.
<?xml version="1.0" encoding="UTF-8"?> <!-- Namespaces for the elements according to LOM-RDF --> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"

13

xmlns:dcterms="http://purl.org/dc/terms/" xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:lom-edu="http://ltsc.ieee.org/2002/09/lom-educational#" xmlns:lom="http://ltsc.ieee.org/2002/09/lom-base#" xmlns:lom-anno="http://ltsc.ieee.org/2002/09/lom-annotation#" xmlns:miweb="http://grizzly.cs.tu-berlin.de/mediator/schema#"> <!-- Example: Description of one e-Learning-resource --> <rdf:Description rdf:about="http://../postrelational/index.html"> <dc:title>Postrelationale Datenbanksysteme</dc:title> <dc:subject> <rdf:Bag> <rdf:li>Datenbanksystem</rdf:li> <rdf:li>Objektorientiertes Datenbanksystem</rdf:li> <rdf:li>Semistrukturierte Daten</rdf:li> </rdf:Bag> </dc:subject> <lom-edu:interactivityLevel rdf:resource="http:...lom-educational#VeryLowInteractivity"/> <dcterms:references> <rdf:Bag> <rdf:li> <rdf:Description> <dc:title>The object-oriented database system manifesto </dc:title> </rdf:Description> </rdf:li> <rdf:li> <rdf:Description rdf:about="http://.../semistructured-paper.ps.Z"/> </rdf:li> </rdf:Bag> </dcterms:references> </rdf:Description>

<!-- Example: Description of the referenced document --> <rdf:Description rdf:about="http://.../semistructured-paper.ps.Z"> <dc:creator> <lom:Entity><vCard:FN>Buneman</vCard:FN></lom:Entity> </dc:creator> <dc:title>Semistructured Data </dc:title> <lom-anno:annotation> <rdf:Bag> <rdf:li> <lom-anno:Annotation> <miweb:citations>167</miweb:citations> </lom-anno:Annotation> </rdf:li> </rdf:Bag> </lom-anno:annotation> <dcterms:created> <dcterms:W3CDTF> <rdf:value>1997</rdf:value> </dcterms:W3CDTF>

14

</dcterms:created> </rdf:Description>

The Query Language RDQL


There exist several query languages for RDF databases. An overview is given in [MKA+ 02]. We use the RDF Data Query Language (RDQL) that is being developed by the Hewlett Packard Semantic Web Group ([URLg]). It is based on SquishQL and uses SQL-like query constructs. A query species triple patterns (WHERE clause) and lter expressions (AND clause). Thereby, variables can be used both on data level and on schema level, i.e. for property labels. The following examples show the queries of our scenarios in RDQL. Scenario 1: Asking for exercises on transactions
SELECT ?url, ?title WHERE (?url, <dc:title>, ?title), (?url, <dc:subjects>, ?subjectBag), (?subjectBag, ?foo, ?keyword), (?url, <rdf:type>, ?learningResourceType) AND ?learningResourceType eq "http:.../lom-educational#Exercise", ?keyword eq "transaction" USING rdf FOR <http://www.w3.org/1999/02/22-rdf-syntax-ns#>, dc FOR <http://purl.org/dc/elements/1.1/>

Scenario 2: Asking for referenced documents of a given learning object


SELECT ?refurl, ?refcreator, ?reftitle, ?refcitations WHERE (?resource, <dcterms:references>, ?foo1), (?foo1, ?foo2, ?refurl), (?refurl, (?foo3, (?refurl, (?refurl, (?foo5, (?foo7, AND ?resource USING dc dcterms lomanno miweb <dc:creator>, ?foo3), ?foo4, ?refcreator), <dc:title>, ?reftitle), <lomanno:annotation>, ?foo5), ?foo6, ?foo7), <miweb:citations>, ?refcitations)

eq "http://.../postrelational/index.html" FOR FOR FOR FOR <http://purl.org/dc/elements/1.1/>, <http://purl.org/dc/terms/>, <http://ltsc.ieee.org/2002/09/lom-annotation#>, <http://grizzly.cs.tu-berlin.de/mediator/schema#>

Not all queries can be expressed in RDQL. For example, transitive closures cannot be determined. One also has to know the path of properties used for 15

selection - you cannot use wildcards at schema position as in XML languages. In addition, there exist some limitations in the MIWeb implementation caused by lacks of the Jena API ([URLe]) when analyzing a given query: The MIWeb system can only process queries that navigate through the database, value-based joins are not possible. Thus, a query Im looking for learning objects about the same topic than the given one. cannot be answered by the MIWeb system as it requires a join over the topic that is no global key attribute. We can only process selections that use the equal operator for comparisons.

2.2.2

Component Technologies

The MIWeb system is a J2EE application: Each component of the MIWeb architecture (see Figure 2.1 on page 5) has a structure as shown in Figure 2.7.
http

JSP Component Facade Java Class

Local Java Class

...

Local Java Class

Prolog

http

Figure 2.7: Component Technologies It consists of a JSP for the remote access to the component using the http protocol, a Java class acting as a facade (see [GHJV95]) to the components realization, and local classes realizing the components functionality. All components are implemented in Java, only the mediator additionally uses a Prolog database for query planning. All wrappers of the system have to provide one pre-dened interface that is used by the mediator. The following chapters describe the interfaces and the class structure of the MIWeb components in detail. 16

Chapter 3

Wrapping of Web Sources


For each source the MIWeb system contains a wrapper component. It translates an RDQL query of the mediator into one or more queries to the specic source, executes them and summarizes the results to an integrated result according to the LOM mediator schema. Wrapping a web source comes up with several problems: 1. Unstructured sources like HTML pages do not provide a schema. The wrapper has to mine a schema for the extracted data. 2. The result data is distributed over several web pages. For example, an item list is given as a rst query result which allows to navigate to more detailed information in a second step. Thus, it may be that the wrapper has to collect the data from dierent pages. 3. Web sources use their own schema dierent from the mediator schema. In this case, a mapping of query and result is required. We use the mapper component for this task (see Chapter 4). 4. Web sources usually provide a restricted query interface with a small number of queryable attributes. The wrapper has to improve such an interface so that queries in a higher order query language like RDQL can be answered. If the source provides a dierent query language, the wrapper has to perform a query mapping. In the MIWeb system we have to solve the rst and the last problem (besides of the mapping): The Google wrapper has to translate RDQL queries into another query language Datalog. The citation index Citeseer is an unstructured source that requires data extraction. In addition, both sources provide only a restricted query interface. Before we describe the design of these wrappers in more detail we discuss our approach in general and introduce a web wrapper architecture addressing all the problems given above.

17

3.1

Classication of Information Sources

Information sources may be categorized along many criteria. Here we will discuss two of them, the degree of source structuring and the degree of accessibility of sources. In terms of structuring, web sources may be structured, semi structured or unstructured. A source is highly accessible if database like access methods with a higher query language exist. Other sources present a restricted interface, e.g. only a restricted query interface. Thus, only their interfaces can be wrapped. Web databases are usually available through a (small) interface only, and moreover presented as HTML, thus an unstructured form is given. Structuring Structured sources may be distinguished ner by the degree of their structuring (semi structured vs. fully structured). Both of them oer a metadata-annotation of the data structure. Thus a part of the underlying schema is given. Popular examples for semi structured data sources are XML documents. In this case the main aord of wrapping is to transform the (usually heterogeneous) query languages. Wrapping of structured data sources becomes more sophisticated if the abstraction level of mediator and sources diers. An example is a mediator on metadata-based level integrating sources on the data level. Additional to a language transformation we also need a schema and data transformation in this case. Model correspondences may be used for this purpose (comp. [Bus02]). Unstructured sources require a more complex wrapping process. Tasks which need to be supported may be distinguished into two processes: Syntactical and semantically induction of structure Language transformation An automated induction of structure may be done with structure-mining techniques. Accessibility In contrast to classical database information sources (e.g. relational databases with an SQL query interface), web sources generally only allow a restricted access to the underlying data sources. The database schema is hidden behind a presentation interface. Thus, the bottleneck in terms of accessibility are restricted query interfaces. HTML sources oer a form-based query interface, which is restricted in their semantics to some attributes which may be requested by typing keywords and combine them by usually only one operator. Another restriction is the general lack of a typing system, each parameter is decoded into a string when using HTTP. In database terminology web source query interfaces usually only allow a at SELECT with regard to certain attributes. Joins, projections, and other relational operators are not supported. Thus, a query through such an interface is much less expressive than a comparable database query could be. In case the underlying schema is known, but not queryable, we propose the Query Tunneling approach to enhance precision of a query (comp. Section 3.3) 18

3.2

General Web Wrapping Architecture

Our general understanding of wrapping is that the wrapper oers a query interface which supports a higher order query language and delivers structured results. As mentioned above, Web sources are neither structured nor queryable in a higher order query language. In order to bridge this gap the wrapper needs to undertake a lot of transformations of query and result representation. Consequently, we have structured the general wrapper architecture along these tasks and the meta information used for them. Figure 3.1 draws a general picture of the components a Web-wrapper may include. This section gives a short overview of the whole wrapping architecture and summarizes all tasks and transformations, which need to be supported by a wrapper.

Export Interface
Export Query Export Result

Schema Mapping Query Relaxation Query Serialization Parameter Extraction


Source Query

Result Restriction Result Integration Result Extraction


Source Result

Source Repository

Source Interface

Source

Figure 3.1: General Wrapping Architecture for Web Sources

3.2.1

Components

Our general architecture supports a distributed handling of dierent kinds of heterogeneity thus the architecture is segmented w.r.t these tasks whereby each component is responsible for one specic transformation task. Query Relaxation / Result Restriction This layer addresses the problem of small query interfaces of web sources. It allows the system to transform the given query to a less restricted form that can be executed by the source. Driven by specied relaxation rules, the transformation contains the substitution or the elimination of selection attributes, respectively. Because this query relaxation leads to a superset of the correct result, a result restriction is done after executing the relaxed query on the source. After the result set has been transformed to an RDF representation the

19

original query is issued against it so that the result set is reduced to elements that fulll the original query. We call this approach query tunneling. It is similar to existing techniques of query relaxation in cooperative information systems ([Lee02, Mot90]) that transform a given query to yield a better precision and recall of the user query. But in contrast, we use this approach to bridge the heterogeneity between wrapper and source interface. Schema Mapping The mapping layer is needed if the wrapper has to support a mediator schema that is dierent to its own schema. It has to consider dierent types of schema and data conicts. Thereby, a transformation is needed both for queries and for the result data objects. In the MIWeb system we use the separate mapper component discussed in Chapter 4 for this purpose.1 Parameter / Result Extraction This component reduces higher order query statements to a list of query parameters that can be piped to the source. On the way back this component deals with extraction, which occurs if the source output is not database-like structured. The output of the result extraction step conforms to the so-called result schema of a source and is well structured (e.g. employing RDF or XML). The most web sources are HTML-based and thus oer no structured output. Result extraction is the focus of numerous related works, e.g. [CMM01], [CHJ02], [Cha01] or [WL03]. [LRNST02] gives a good overview on this issue. We follow a grammar-based paradigm [Kab03]. In the MIWeb system we dened the grammar manually. In future we want to generate it automatically from the model, similar to [CMM01] or [NSS02]. Query Serialization / Result Integration In some cases a query needs to be divided into multiple source queries in order to get all results, for example if the source is restricted in its operational richness so that complex queries like nested queries have to be substituted by multiple simple queries which are sent separately to the source, the source delivers a restricted number of result sets so that more than one query is necessary to get all query answers, or if the source data is distributed over several web pages that have to be parsed to collect all relevant data. The ASByE2 approach introduced in [GLdSRN00] describes the idea of a tool to design a wrapper for web sites consisting of several pages. The source description contains important information for query serialization. It species if queries need to be serialized because of result set restrictions or restricted capabilities of query attributes. To avoid partial result sets additional information about the behavior of the source is stored: Either it delivers all results at once or it cuts after a predened number of results.
that in our prototype the mapper component only transforms RDF data. The transformation of query elements is done implicitly in the query relaxation process. 2 ASByE = Agent Specication by Example
1 Note,

20

3.2.2

Source Repository

Generally the Source Repository contains the meta information for the conguration of the wrapper. Each wrapping component uses the Source Repository for its own tasks. Thus it contains four sections Relaxation Rules, Mapping Rules, Extraction Grammars, and Serialization Information. Relaxation Rules The task of the corresponding Relaxation Component is to ensure a fully queryable interface to the source which is based on the source result schema R. Thus, relaxation rules are formulated to answer the following questions: Which attributes of R are part of the Interface Schema I? Which attributes of R are queryable and which not? Mapping Rules Whereas relaxation rules are based on the source result schema R, additional mapping rules have to be applied, if the desired wrapper export schema W is dierent from the result schema R. This is mostly the case if the integration into a mediator based information system with a common schema is targeted. In this case schema mapping rules are formulated which solve schema heterogeneity between R and W . In our MIWeb system we use the mapper component for this purpose. Extraction Grammars The repository entry for extraction grammars needs to be distinguished into parameter extraction grammars used for parameter extraction and result extraction grammars. The rst ones are used in order to extract queryable elds from the relaxed and serialized query expression. An important point in this part of these grammar is the aggregation of multiple query conditions for one interface eld to one expression. The result extraction grammar needs to mine the result schema from the unstructured result document which is delivered by an HTML source. Some approaches use regular expression for this task, other derive their own wrapping grammars [Cha01], [NSS02], [GKS04], [CMM01]. Serialization Information Under this repository segment we sum up information required if queries need to be serialized because of result set restrictions or restricted capabilities of interface elds. Thus, for each attribute of the interface schema attrI there has to be information whether complex queries conditions are allowed to this eld or not. To avoid partial result sets caused by source limitations additional information have to be provided about the behavior of the source: Either it delivers all results at once or it cuts after a predened number of results. In that case information is stored how to split a query into partial ones which deliver all valid results. This part is subject of future research.

21

3.3

Query Tunneling

We will give a short overview on our query tunneling approach in the MIWeb system befor we discuss the challenges in more detail.

3.3.1

Overview

The query tunneling approach allows us to integrate web sources with restrictive query interfaces into a mediator-based information system that uses a higher order query language like RDQL or XQuery. Google and Citeseer are good examples for such sources. Our starting point are techniques of query relaxation investigated in the information retrieval as well as in the database area ([Mot90, CC94, BYRN99]). In the context of cooperative information systems it is used to relax a user query to a less restricted form to permit approximate answers. We follow the denition in [Lee02] of query relaxation as a transformation of a query so that a greater number or bigger-scoped answers are returned (p. 26). When using query relaxation for wrapping a restrictive web source interface, we have to consider the elements (attributes) that can be used in queries or that are part of the result, respectively. A result attribute can also be queryable if there exists an attribute usable in queries that queries the result attribute implicitly because of an existing index. A source that oers a keyword search is a good example for this situation. Following this idea we use two relaxation rules: Attribute Substitution Rules that substitute a queryable attribute by an attribute that enables the selection (explicitly or implicitly). Attribute Elimination Rules that specify non-queryable attributes, i.e. attributes that can neither be queried explicitly nor implicitly. The number of citations in Citeseer is an example of such a non-queryable attribute. Using these rules we can translate a query into a query that can be executed by the source. As the MIWeb system only allows the equal operator for selections, we do not discuss range queries here. In the MIWeb system we substitute all selection attributes and use them for a keyword search. Non-queryable attributes are not provided at the wrapper interface so that they have not to be eliminated. The HTML result will be transformed to RDF by a grammar-based parser. But we know that the query relaxation leads to a superset of answers to the original query. Therefore we need a result restriction after processing the relaxed query. The original query is executed again against the RDF result. This eliminates data that does not fulll the given search criteria. We call our approach of query relaxation and result restriction query tunneling. An overview on the process is depicted in Figure 3.2. As you can see, the data has to be transformed before the result restriction because the original 22

:Wrapper

:Transformator

:Schema Mapper

:Query Extractor

:Result Restrictor

:Source

RDQL [global] Parameter List HTTP Request HTTP Response HTML RDF [local] RDF [local] RDF [global] RDF [global] + RDQL [global] RDF [global]

Figure 3.2: Query Tunneling query is dened against the LOM schema. In the MIWeb system the mapper is responsible for mapping the RDF representation into a LOM-compatible RDF representation. Finally, the result of executing the original query is given back to the client (in our example the mediator).

3.3.2

Introducing Examples

We are using the scientic publication source CiteSeer, cf. [URLb] for our example queries. This data source supports only keyword-queries.

Figure 3.3: CiteSeer Interface CiteSeer oers an interface schema ICiteSeer , that only allows keyword retrieval: ICiteSeer = (keyword). The result schema RCiteSeer is more sophisticated, we will discuss the overview page here whose schema may be denoted as RCiteSeer = (title, author, year, link, citations). Thus, the only query capa-

Figure 3.4: CiteSeer Result Snippet

23

bility of this source is keyword (title, author, year, link, citations). Higher order queries are generally not supported. Based on four dierent query examples we will discuss our approach. Example 1: Simple Query Return all papers of the author "Garcia-Molina". This is an example for a simple query which cannot be issued against the interface directly. The attribute author is not a valid query attribute, but the value could be a selection criteria for a keyword query. Example 2: Non queryable Attributes Return all papers of the author "Garcia-Molina" which have more than 100 citations". This query is challenging because the number of citations is an element of the result schema but it is not queryable the attribute is not indexed in the underlying data source. Example 3: Range Query Return all papers of the author "Garcia-Molina" which have been written between 1998 and 2004. Most sources support only exact matches thus a range query cannot be issued. Even if this given query might be rewritten to certain exact queries, this is not a suitable way in general. A complex query containing selection criteria for queryable attributes combined with Boolean operators (e.g. AND, OR, and brackets (...)) can be issued directly if the query interface supports the respective operators. Otherwise, if the the source is not capable for such complex queries, an adequate handling is needed in order to split the query into its atoms. This is called Query Serialization. If the results of the dierent queries are received, an integration will be made. Example 4: Complex Query Return all publications written by "Garcia-Molina" or with a title containing "federation". If the source does not support disjunctive (OR) queries, the query has to be split into two queries: Return all publications written by "Garcia-Molina" and Return all publications with a title containing "federation". Then, the union of the results forms the set of the original query results.

3.3.3

Query Relaxation

Relaxation Rules Relaxation rules will be distinguished into Attribute Substitution Rules and Attribute Elimination Rules. Attribute Substitution Rules Attribute Substitution Rules are formulated if an attribute of the source result schema attrR is not provided in the interface schema I but it exists an attribute of the interface schema attrI which may be used to query attrR implicitly. In this case we call attrR a 24

Queryable Attribute, because the extensions of attrR are indexed and can be queried. An Attribute Substitution Rule is denoted as: attrR attrI . The most common use case for the application of an attribute substitution rule is a form-based data source which oers only a keyword search. Attribute Elimination Rules In contrast, elimination rules occur if an attribute of the result schema attrR does not exist in the interface schema I, and additionally, if there is no suitable attribute of the interface schema which allows an implicit querying of attrR . In that case no useful substitution is possible, more over a substitution would produce incorrect query relaxations. We call this kind of attributes Non-queryable Attributes, meaning that the interface of the source does not provide any query opportunity for them. A relaxation rule for a non-queryable attribute attrR is written attrR , meaning that this attribute needs to be eliminated from the query before the query is issued against the source. Note that Query Tunneling provides an added value, because non-queryable attributes will become queryable at the wrapper export interface, as we will elaborate in Section 3.3.4. While the query interface of CiteSeer provides only a keyword search, our wrapper is able to query the attributes of the result schema, too. It provides the result schema RCiteSeer . Thus, the wrapper provides the query capability (title, author, year, link, citations) (title, author, year, link, citations). Internally, each of the left-hand attributes has to be either mapped to the interface schema of CiteSeer by attribute substitution (i.e. to the keyword) or eliminated from the query: title keyword, author keyword, year keyword, link , and citations . Query Rewriting Using Relaxation Rules Outgoing from the original RDQL query, the wrapper relaxes it according to the source description. For each of the selection attributes attrR of the query it has to be checked if it is queryable, i.e. if a substitution rule exists. In this case, the selection condition is rewritten accordingly. Otherwise an elimination rule exists and the attribute is removed from the query. Then, this selection attribute attrR has to be applied later, in the result restriction phase. We discuss the query relaxation along the examples introduced in the last section. Query with Queryable Attributes Only (cf. Example 1 and 4). The simple query of Example 1 can be written in RDQL as (Q1) SELECT * WHERE (?resource, <cs:author>, ?author) AND ?author =~ "Garcia-Molina" USING cs FOR <http://cis.cs.tu-berlin.de/citeseer-rdf/> Using the substitution rule author keyword it is simply relaxed to (Q2) SELECT * 25

WHERE AND USING

(?resource, <cs:keyword>, ?keyword) ?keyword =~ "Garcia-Molina" cs FOR <http://cis.cs.tu-berlin.de/citeseer-rdf/>

The complex query given in example 4 is represented in RDQL as: (Q3) SELECT * WHERE (?resource, <cs:author>, ?author) (?resource, <cs:title>, ?title) AND (?author =~ "Garcia-Molina" || ?title =~ /Federation/) USING cs FOR <http://cis.cs.tu-berlin.de/citeseer-rdf/> Since both author and title are queryable via the keyword eld, it is relaxed to (Q4) SELECT * WHERE (?resource, <cs:keyword>, ?keyword) AND (?keyword =~ "Garcia-Molina" || ?keyword =~ "Federation") USING cs FOR <http://cis.cs.tu-berlin.de/citeseer-rdf/> Query Containing Non-Queryable Attributes (cf. Example 2). Elimination rules have to be applied for non-queryable attributes. Outgoing from the following RDQL query (Q5) SELECT * WHERE (?resource, <cs:author>, ?author), (?resource, <cs:citations>, ?citation) AND (?author =~ "Garcia-Molina" && ?citations >= 100) USING cs FOR <http://cis.cs.tu-berlin.de/citeseer-rdf/> the elimination rule citation is applied the citations attribute cannot be queried through the query interface. Eventually it is also relaxed to the query (Q2) using a substitution rule. Range Query (cf. Example 3). If range queries cannot be applied through a query interface, the respective selection predicates have to be eliminated from the query like non-queryable attributes and applied to the result thereafter. Note that the keyword query (Q2) is also yielded if we apply a range query to the equality-queryable attribute year as follows: (Q6) SELECT * WHERE (?resource, <cs:author>, ?author), (?resource, <cs:year>, ?year) AND (?author =~ "Garcia-Molina" && ?year >= 1998 && ?year <= 2004) USING cs FOR <http://cis.cs.tu-berlin.de/citeseer-rdf/> 26

Alternatively, one could relax this query with several disjunctive selection conditions: (Q7) SELECT * WHERE (?resource, <cs:keyword>, ?keywordAuthor), (?resource, <cs:keyword>, ?keyword) AND (?keywordAuthor =~ "Garcia-Molina" && (?keyword =~ "1998" || ?keyword =~ "1999" || ?keyword =~ "2000" || ?keyword =~ "2001" || ?keyword =~ "2002" || ?keyword =~ "2003" || ?keyword =~ "2004")) USING cs FOR <http://cis.cs.tu-berlin.de/citeseer-rdf/> But in general, range queries cannot be rewritten to equality-based queries, e.g. for the condition ?year <= 2000 or for continuous range intervals.

3.3.4

Result Restriction

Since we pose relaxed queries against the source, the result set might contain superuous records. For instance, if a full-text index is accessed for the keyword search (as for CiteSeer), the results might contain a selection criteria anywhere in the respective document and not necessarily in the author or title attributes. Moreover, given the relaxed range query (Q7) above, several records in the result set might contain the respective year in their references section and do not necessarily be published between 1998 and 2004. Thus the results has to be ltered w.r.t. the previously relaxed selection criteria. In order to get the results fullling all the selection criteria stated in the original RDQL query, we execute the original query against the intermediate RDF result set. We give an example. For the RDQL query (Q5) at page 26 we issue the relaxed query (Q2) from page 25 against CiteSeer, whereby only the author name can be taken for the keyword search. Consequently, the results of this query has to be post-processed as follows: Only results that were reported as cited at least hundred times shall be ltered, and The string Garcia-Molina has to be part of the reported authors, and not only contained elsewhere. The latter results will be removed from the results. Thanks to our architecture, the intermediate result is represented in RDF, so that we can execute RDQL queries on it by means of the Jena API. The execution of the original RDQL query against the result of the relaxed query delivers the correct result. For instance, given the three results displayed in Figure 3.5 on page 28, only the rst result fullls the original query and will be returned by the wrapper. In detail, for the second result the string Garcia-Molina is not contained in the <cs:author> element, while it is more than hundred times

27

<?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cs="http://http://www.cis.cs.tu-berlin.de/citeseer-rdf#"> <rdf:Description rdf:about= "http://citeseer.ist.psu.edu/papakonstantinou95object.html"> <cs:title>Object Exchange Across Heterogeneous Information Sources</cs:title> <cs:author> ... Hector Garcia-Molina ... </cs:author> <cs:year>1995</cs:year> <cs:link>www-db.stanford.edu/pub/papers/icde95.ps</cs:link> <cs:citations>243</cs:citations> </rdf:Description> <rdf:Description rdf:about="..."> <cs:title>Zebra: A Striped Network File System</cs:title> <cs:author>John H. Hartman, John K. Ousterhout</cs:author> <cs:year>1993</cs:year><cs:link>...</cs:link> <cs:citations>157</cs:citations> </rdf:Description> <rdf:Description rdf:about="..."> <cs:title>Managing Semantic ... </cs:title> <cs:author>Richard Hull</cs:author> <cs:year>1997</cs:year><cs:link>...</cs:link> <cs:citations>88</cs:citations> </rdf:Description> </rdf:RDF>

Figure 3.5: An Example RDF Result cited. In the third result none of both selection criteria is fullled it is less then hundred time cited and the <cs:author> element does not match the criteria. In fact, publications of Garcia-Molina are referenced in both articles and therefore these are results of the keyword search for "Garcia-Molina".

3.4

MIWeb Wrapping Components

The MIWeb system integrates four autonomous information sources. Two of them (CiteSeer [URLb] and the RDF interface of Google, Roodolf [URLh], are integrated over the World Wide Web. The other sources are local RDF databases describing learning objects. Each of these sources tackles dierent aspects of wrapping. Whereas wrapping of the web interface of CiteSeer is mainly a classical Web-content extraction task, the Roodolf wrapper mainly supports language transformation. The eLearning sources are well structured and provide an RDQL/RDF interface, so that we only need to solve some heterogeneity which is completely covered by the mapping component (see Chapter 4). Therefore, this section focusses on the implementation of the two former wrappers.

28

3.4.1

The Google / Roodolf Wrapper

The wrapper for the Google API [URLd] is based on Roodolf [URLh], an RDF interface that allows one to query the Google API using a Datalog subset. Thus, the Google wrapper transforms RDQL queries to Datalog queries. The result of the underlying source is an RDF document that is returned to the mediator. The wrapper implements the query tunneling approach in order to cope with the restricted query interface of the Google-API. The query language of Roodolf is a subset of Datalog, restricted to plain or conjunctive queries. In case of more complex queries (i.e. disjunctive) the Google wrapper serializises the query into a number of plain or conjunctive queries suitable for the underlying data source. Each generated subquery is issued against the the underlying source. The result documents are integrated to one aggregated result. The query interface of Roodolf is based on 12 query terms which are oered by the Google API. In the context of the MIWeb system mainly the following four query terms are important: Titlephrase: The query will be restricted to pages, which contain in their title the entered search criteria in the given order. Phrase: Similar to Titlephrase, but enhanced to full text search in the whole document. Allintext: Searching for sites, which contain the search criteria in their body. FileType: Restricts the search to documents of a certain le type. Thus, the wrapper provides mappings from the corresponding RDF properties Title, Keywords, Description and Format to the given query terms. Finally, we give an example. Suppose a query with the keywords RDQL and Mapping. The RDQL query could look like this: SELECT ... WHERE (?resource, <http://purl.org/dc/elements/1.1/keywords>, ?keyword1), (?resource, <http://purl.org/dc/elements/1.1/keywords>, ?keyword2) AND ?keyword1 =~ "RDQL" AND ?keyword2 =~"Mapping" The corresponding Datalog query is: :- phrase(X, RDQL), phrase(X,Mapping);

3.4.2

The Citeseer Wrapper

CiteSeer is a web source [URLb], which describes publications in the domain of computer science. It provides information in dierent degrees to each publication. CiteSeer is only accessible by an classical HTML-based interface. The 29

wrapping needs to extract query keywords and result information from the unstructured HTML response. The rst implementation of the CiteSeer wrapper is restricted to the rst given overview-page. The wrapping grammar is based on the textual information, which is visible to the user, the HTML-structure is not in the focus of investigation. On that overview page for each publication a so-called snippet is given. Figure 3.6 shows an example of such an snippet.

Figure 3.6: Snippet Example from CiteSeer The information of the snippet mainly consists of the following parts: Title of the Resource Year of Publication List of Authors Cut-Out of Text Number of Citations The task of the CiteSeer wrapper is to extract these information and to label it by the given tags. The wrapper uses a grammar-based approach to parse the structure. Figure ?? gives an overview on the grammar of such a snippet. Helpers ascii = [ 0 .. 255 ] ; ascii_title = [ascii - -]; ascii_author = [ascii - [( + ,] ] ; State title, author Tokens {title} title = ascii_title+; {title -> author} dash =-; {author} authorname = ascii_author+; comma = ,; Productions snippet = title dash authors; authors = {singleauthor} authorname | {multauthor} authorname comma authors Figure 3.7: Cut-out of the Citeseer Grammar The syntax of this grammar is based on the compiler compiler Sable-CC that has been used for the development of the wrapper classes. Thus, the grammar is state-oriented. Nonterminals are denoted as Token, terminals as Helper. Having formulated the grammar, a specic wrapper class can be compiled with respect to the input grammar. 30

Chapter 4

Mapping of RDF Data


The mapper component allows one to transform RDF data. The transformation is specied by a mapping rule. We call it correspondence to emphasize that it can be used for transformations in both directions. The correspondences are specied explicitly and stored by the mapper component. By this way, new correspondences can be added dynamically or existing ones can be changed. The following sections describe the interface and the design of the mapper component after discussing correspondences in more detail.

4.1

Model Correspondences

We rstly give an overview on model correspondences before we show how model correspondences can be dened in the MIWeb system and how they are managed by the mapper component.

4.1.1

The Idea of Model Correspondences

Coping with heterogeneous data is one of the main tasks in a federated information system because the participants have developed their systems and data schemas independent from each other. Thereby heterogeneous means ([BKLW99]): data model heterogeneity capturing the fact that the data is represented using dierent data models; semantical heterogeneity concerning the semantics of data and schema (even in a common data model). It primarily addresses the naming problem: equal names may denote dierent concepts (homonyms), dierent names may denote the same concept (synonyms); schematical heterogeneity that belongs to dierent abstraction levels (data vs. schema level) used for modelling concepts. In the relational data 31

model, three types of such conicts exist ([Mil98]): relation vs. attribute name, attribute name vs. attribute value, and relation vs. attribute value. Semistructured data contains similar conicts like tag name vs. element value; structural heterogeneity existing if elements that have the same meaning are structured in dierent ways, e.g. if elements are grouped into dierent entities. Data model heterogeneity is solved by the wrapper components: they translate the data into the common data model RDF in the MIWeb system. Whereas such a translation only depends on the data model concepts, it is a real challenge to identify semantic relationships between the data contents: It is impossible to dene the data semantics completely in a schema. Therefore, we usually need a domain expert who species these semantic relationships called model correspondences. In the last decade the problem of (semi-)automatic correspondence discovery by schema matching or service matchmaking has moved into center of research. We assume a manual specication for our purposes. A good overview on schema matching approaches is given in [BP01]. Research on explicitly specifying correspondences starts in the beginning of the 90th. The focus was on the denition on languages for correspondences and their use in dierent contexts like federated information systems ([PAGM96, LRO96]), schema integration ([Sch98]), data warehouses ([CGL+ 99]), and schema mapping ([ISO04, W3C99]). Briey, a model correspondence is a set of correspondences that species data elements of two schemas that represent the same information. The correspondences can be specied either as a relation that is interpreted for data transformation or as a (uni-directional) mapping that can be executed for transformation. Queries like in Clio ([PVM+ 02]) or an XSLT rule are good examples for such a mapping denition. We propose the specication of correspondences as relations for two reasons: Firstly, the specication of the domain expert can be used for data transformation in both directions. Secondly, we can use the specication for the transformation of data as well as for the transformation of queries. To clarify the terminology, we distinguish the terms correspondence and mapping: Correspondence: The corresponding elements provide (at least partially) the same information. Mapping: A mapping is a computable procedure that transforms values according to a correspondence. Following our approach we need a (meta)model that allows us to specify model correspondences and algorithms that do the data transformation based on these correspondences. Thereby, two implementations are possible: We could determine mappings from the correspondence denitions expressed in an existing 32

mapping language like XSLT or we could implement the algorithm for data transformation by ourselves. In the beginning, XSLT seems to be suitable for our purposes because RDF uses an XML representation. But the mapping would depend on the chosen representation (there exist several XML representations for one RDF model!). Therefore we decided to implement new data transformation algorithms operating directly on the RDF models.

4.1.2

Metamodel

We dene a metamodel for model correspondences to allow their explicit specication (see Figure 4.1). Note, that it is dened according to the requirements of the MIWeb system. It should be seen as a starting point for such a mapper component. Although we already use this metamodel for another application domain without any problems, it does not claim for completeness.
ModelCorrespondence ____________________ LeftSchema RightSchema
1..*

Correspondence ____________________ Name

(Step) 2..*

SimpleCorrespondence ____________________ LeftPath RightPath

ComplexCorrespondence

SequentialCorrespondence

AlternativeCorrespondence

SameAs

ElementCorrespondence

ValueCorrespondence ____________________ LeftValue RightValue

Figure 4.1: Structure of Model Correspondences A model correspondence relates two schemas that can be identied by their names. The model correspondence contains a set of correspondences. We distinguish simple and complex correspondences. A simple correspondence relates two single schema elements that are identied by their property path starting at the resource they belong to. Thereby we distinguish three types: Both the property paths and their values can be equal (SameAs); the property paths can be dierent (ElementCorrespondence) or property paths and values can be dierent (ValueCorrespondence with the additional specication of the corresponding values). A complex correspondence aggregates a set of correspondences that are needed to get one schema element of one of the related schemas. The included correspondences can be 33

alternatives that can be used to get the target element (and its values). Then, the related schema represents the same thing in dierent ways. a sequence of mapping steps that have to be done step by step to get the target element from the source schema. Obviously, an alternative correspondence can only be used in one direction. But you can dene one alternative as a default mapping that is used for the transformation to the schema with more alternatives. The default mapping is specied outside the alternative correspondence. If no default mapping is specied, an alternative correspondence will be ignored. The metamodel is specied by an XML schema (see Appendix A). It contains constraints like cardinalities and required attributes. In addition, the following constraints are dened on a set of model correspondences: Between two schemas (identied by their name) may exist at most one model correspondence. The names of corresponding schemas are dierent. A schema element path can occur in only one same as or element correspondence. A schema element path with a specic value can occur in only one value correspondence. A schema element path of a value correspondence may only occur in other value correspondences but not in other kinds of correspondences. The schema element paths of a same as correspondence are equal. All correspondences of an alternative correspondences have the same left side or the same right side. The other side is always dierent to others.

4.1.3

Model Correspondence for NewEconomy

Although both the mediator and the New Economy source are based on the LOM metadata schema, they use dierent RDF representations. Thus, a mapping is needed. The following table shows the corresponding structures and the kind of correspondence.
1.2 General.Title New Economy / Mediator <dc:title>Analyse...</dc:title> 1.3 General.Language New Economy / Mediator <dc:language> <dcterms:RFC1766> <rdf:value>DE</rdf:value> </dcterms:RFC1766> </dc:language> 1.4 General.Description New Economy / Mediator <dc:title>Dieses...</dc:title> same as correspondence same as correspondence

same as correspondence

34

1.5 General.Keyword New Economy / Mediator <dc:subject> <rdf:Bag> <rdf:li>analysis</rdf:li> <rdf:li>database systems</rdf:li> </rdf:Bag> </dc:subject> 1.8 General.Aggregation Level New Economy <ne:granularitytype rdf:resource=".../NE/rdfs#module"/>

same as correspondence

value correspondence Mediator <lom-gen:aggregationLevel rdf:resource=".../lom-general# AggregationLevel3"/> <lom-gen:aggregationLevel rdf:resource=".../lom-general# AggregationLevel2"/> <lom-gen:aggregationLevel rdf:resource=".../lom-general# AggregationLevel1"/> element correspondence Mediator <dc:creator> <lom:Entity> <vCard:FN>CIS TU - Berlin</vCard:FN> </lom:Entity> </dc:creator> same as correspondence

<ne:granularitytype rdf:resource=".../NE/rdfs#component"/> <ne:granularitytype rdf:resource=".../NE/rdfs# physicalelement"/> 2.3.2 LifeCycle.Contribute.Entity for 2.3.1 Contributor.Role = author / publisher New Economy

<dc:creator>CIS TU - Berlin</dc:creator> 4.1 Technical.Format New Economy / Mediator <dc:format> <dcterms:IMT> <rdf:value>image/gif</rdf:value> </dcterms:IMT> </dc:format> 4.2 Technical.Size New Economy / Mediator <dcterms:extent> <lom-tech:ByteSize> <rdf:value>6286</rdf:value> </lom-tech:ByteSize> </dcterms:extent> 5.2 Educational.Learning Resource Type New Economy <rdf:type> <rdf:Bag> <rdf:li>.../lom-educational #Introduction</rdf:li> <rdf:li>.../lom-educational #NarrativeText</rdf:li> </rdf:Bag> </rdf:type> 5.5 Educational.Intended End User Role New Economy <lom-edu:intendedEndUserRole rdf:resource= ".../lom-educational#Learner"/> 5.9 Educational.Typical Learning Time New Economy / Mediator <lom-edu:typicalLearningTime> <lom:ISO8601> <rdf:value>30</rdf:value> </lom:ISO8601> </lom-edu:typicalLearningTime>

same as correspondence

element correspondence Mediator

<rdf:type rdf:resource= ".../lom-educational#Introduction"/> <rdf:type rdf:resource= ".../lom-educational#NarrativeText"/> element correspondence Mediator <lom-edu:intendedEndUserRole> <rdf:Bag> <rdf:li rdf:resource= ".../lom-educational#Learner"/> </rdf:Bag> </lom-edu:intendedEndUserRole> same as correspondence

35

7.2 Relation.Resource.Identier for 7.1 Relation.Kind = is part of New Economy <dcterms:isPartOf rdf:resource="http://.../index.html"/> or <dcterms:isPartOf> <rdf:Bag> <rdf:li>http://.../index.html</rdf:li> <rdf:li>http://.../a31.html</rdf:li> </rdf:Bag> </dcterms:isPartOf> 7.2 Relation.Resource.Identier for 7.1 Relation.Kind = has part New Economy

alternative correspondence with two element correspondences Mediator <dcterms:isPartOf> <rdf:Bag> <rdf:li rdf:resource="http://.../index.html"/> <rdf:li rdf:resource="http://.../a31.html"/> </rdf:Bag> </dcterms:isPartOf>

element correspondence Mediator <dcterms:hasPart> <rdf:Bag> <rdf:li rdf:resource="http://.../index.html"/> <rdf:li rdf:resource="http://.../a31.html"/> </rdf:Bag> </dcterms:hasPart>

<dcterms:hasPart rdf:resource="http://.../index.html"/> <dcterms:hasPart rdf:resource="http://.../a31.html"/>

4.1.4

Model Correspondence for CiteSeer

The CiteSeer wrapper uses a specic data structure for metadata about publications. The following table shows the mapping to the mediator schema.
1.2 General.Title CiteSeer / Mediator <dc:title>Semistructured...</dc:title> 1.4 General.Description CiteSeer / Mediator <dc:title>This...</dc:title> 2.3.2 LifeCycle.Contribute.Entity for 2.3.1 Contributor.Role = author CiteSeer element correspondence Mediator <dc:creator> <lom:Entity> <vCard:FN>Buneman</vCard:FN> </lom:Entity> </dc:creator> element correspondence Mediator <dcterms:created> <dcterms:W3CDTF> <rdf:value>1997</rdf:value> </dcterms:W3CDTF> </dcterms:created> element correspondence Mediator <lom-annotation:annotation> <rdf:Bag><rdf:li> <lom-annotation:Annotation> <miweb-med:citations> 387 </miweb-med:citations> </lom-annotation:Annotation> </rdf:li></rdf:Bag> </lom-annotation:annotation> same as correspondence same as correspondence

<dc:creator>Buneman</dc:creator> 2.3.2 LifeCycle.Contribute.Date for 2.3.1 Contributor.Role = author CiteSeer

<dc:date>1997</dc:date> 6.3 Annotation.Description CiteSeer

<miweb-cs:citations>387</miweb:citations>

36

4.2

Interface of the Mapper Component

The mapper represented by the TransformationManager supports three tasks (see Figure 4.2): the transformation of RDF data, the management of correspondences, and the administration of the mapper component itself which particularly includes the management of the le-based store of correspondences.

<<interface>> RDFTransformer _______________ _______________ transform

<<interface>> Correspondence Registry _______________ _______________ register delete showRegistry

<<interface>> MapperAdmin _______________ _______________ loadNewStore replaceStore moveStore

TransformationManager

Figure 4.2: Interfaces of the Mapper Component

Interface RDFTransformer
The RDFTransformer interface only contains one operation to transform RDF data from one schema to another. The schemas are identied by their names. (parameters Source and Target). If no correspondence between these schemas is registered at the mapper, a NotExistsException is thrown. A TransformationException indicates errors during the transformation process.
interface RDFTransformer { String transform (InputStream data, String source, String target) throws NotExistsException, TransformationException; }

Interface CorrespondenceRegistry
The mapper provides operations to register a correspondence, to delete one or more correspondences, and to write all registered correspondences to a specied output stream. Similar to the transformation interface, correspondences can be 37

identied by the names of the schemas that are related. To specify correspondences we use an XML string representation that will be described in the next section. We have also specied constraints on correspondences that are used for validating a given correspondence (ValidationException). Problems with the external storage of correspondences are remarked with the LoadException and SaveException, respectively.
interface CorrespondenceRegistry { /** register a new correspondence */ void register(String modelCorrespondence) throws ValidationException, LoadException, SaveException;

/** delete one correspondence */ void delete(String model1, String model2) throws SaveException; /** delete all correspondences related to the given schema */ void delete(String model) throws SaveException;

/** write all registered correspondences to the output stream */ void showRegistry(java.io.Writer out); }

Interface MapperAdmin
The administration interface only contains operations for changing the les that are used to store registered correspondences. In addition, the JSP of the mapper component provides an operation to delete the log les.
interface MapperAdmin { /** load another registry file and override the actual store */ void loadNewStore(String newStoreName) throws SaveException, LoadException; /** change to another registry (and use its store file further on */ void replaceStore(String newStoreName) throws LoadException; /** move the actual store file to another location in the file system */ void moveStore(String newStoreName) throws SaveException; }

38

4.3

Design of the Mapper Component

The mapper component consists of four parts (see Figure 4.3):


Mapper
TransformationManager

<<uses>> <<uses>>

Transformation

Registry

Rule

Transformer

Correspondence Registry
1

Simple CorrespondenceRule SameAsRule ElementCorrespondence Rule


<<uses>> <<uses>>

Validator

...
Complex CorrespondenceRule

Correspondence

...

Figure 4.3: Design of the Mapper Component The Correspondence package provides a Java representation of the metamodel of model correspondences. The Registry package allows one to store and to search for model correspondences. In particular, the registry checks the consistency of model correspondences according to the constraints dened in the last section. Thus, this package is used to realize the CorrespondenceRegistry interface. The Transformation package is used to realize the RDFTransformer Interface, i.e. to transform RDF data according to a given model correspondence. As the transformation algorithm is specic for each kind of correspondence, we use the Strategy pattern ([GHJV95]): Specic Rule classes realize the transformation operation for one specic correspondence type. The TransformationManager is the facade class of the mapper component. It is responsible to delegate incoming calls to the CorrespondenceRegistry or the Transformer.

39

4.3.1

Managing Model Correspondences

We use the data binding framework Castor [URLa] for the persistent management of model correspondences. The framework generates a Java representation of model correspondences from the XML schema (see Appendix A). The Java classes also contain an XML serialization mechanism. The correspondence registry uses this XML binding for a le-based storage. The consistency of the model correspondence store is checked by the validator. Note, that it does not only check for well-formedness, but also for the semantic constraints dened in the last section. A correspondence registry can be used simultaneously by clients, even by simultaneous threads. The registry synchronizes the access when the store is changed (using the synchronization mechanism of Java). As the store will be changed rarely, this will not result in a bottle-neck of the application.

4.3.2

Transforming RDF Data

The transformation process that is initiated by a transformation call consists of three main steps (see Figure 4.4):
:Transformation Manager :Correspondence Registry :Transformer

transform (inputData, source,target) get (source, target)

modelCorr determine mapping direction transformLeftToRight / transformRightToLeft (inputData, modelCorr)

resultData

resultData

Figure 4.4: Transformation of RDF Data 1. The mapper looks for a model correspondence between the given models using the correspondence registry. If no such correspondence exists, an exception is thrown indicating that no transformation is possible. 2. The mapper compares the given model names with these in the model correspondence to determine in which direction the correspondence has to be read to do the transformation. 3. The transformer does the transformation and gives the transformed RDF data in a string representation back to the mapper.

40

As already described, we use the Strategy Pattern of [GHJV95] for the implementation of transformation rules that belongs to specic kinds of correspondences. All rules have to provide the methods applyLeftToRight and applyRightToLeft that realizes the transformation based on an RDF model. If the methods are called with a wrong correspondence type, a NotSupportedException is thrown. With regard to the transformation, we follow the idea of transforming as much data as we can. The process is shown in Figure 4.5. The transformer takes all correspondences of the given model correspondence. For each, it tries to nd a rule that can do the transformation. If no rule can be found, i.e. there exists no transformation algorithm for this correspondence type yet, this fact is written to a log le and the correspondence is skipped. Similar, errors during the transformation are logged, but will not break o the transformation process. The result of the transformation as a whole is the union of all the RDF models that are built successfully using a single correspondence.
T transformLeftToRight (inputData, modelCorr) read inputData into sourceRDFModel

targetRDFModel := empty model

take next corr out of modelCorr [ corr found ] take next rule of known Rule classes

[ no more corr ] [ no more rule ] return targetRDFModel [ rule found ] actTarget := rule.applyLeftToRight (sourceRDFModel, corr) [ NotSupportedExc ] [ successful ] targetRDFModel := targetRDFModel UNION actTarget [ TransformationExc ] log error message log missing rule

Figure 4.5: Strategy Pattern for Transforming RDF Data The implementation of the Transformer and the Rule classes is based on the Jena API [URLe] that provides methods to operate on RDF models. The rules that belong to simple correspondences mainly consists of two steps: rstly, querying the data elements from the source model that are addressed by the source part of the correspondence, and secondly building model elements according to the target part of the correspondence. In contrast, the rules that belongs to complex correspondences uses a transformer again to transform the data according to the correspondences that are part of the complex correspondence.

41

Chapter 5

Mediator-based Integration
The main task of the mediator component is answering RDQL queries on the MIWeb schema. This is quite dierent to query processing in central database systems due to the heterogeneity and distribution involved. The given query has to be translated into sets of queries against the wrappers integrated into the MIWeb system. This problem of query rewriting is well known in federated information systems ([Hal01, LMSS95, PAGM96]). Thereby, two approaches can be distinguished: the approach of view expansion in systems that denes the mediator schema as views on the underlying wrapper schemas (called global-as-view) and the approach of answering queries using views in systems that allows the wrappers to dene their schema as views on the mediator schema (called local-as-view). No matter which approach is chosen two problems are considered: the heterogeneity of mediator and wrapper schema and the restricted query capabilities of wrappers. In the MIWeb system, the heterogeneity between mediator and wrapper schema is solved by the wrapper components (using the mapper) so that the mediator only has to consider the restricted query interfaces. Thus, we describe our query processing algorithm independently from the existing approaches to reduce complexity although we know about similarities. Briey, the query processing of the mediator consists of three dierent activities. Initially the query has to be planned with regard to the information sources that are registered in the MIWeb system. The resulting plan is executed by sending the identied partial plans to the sources. Finally, the resulting RDF models are integrated and presented to the user. Moreover, the mediator is responsible for managing two kinds of metadata. The query capabilities of integrated sources are registered by the source administrators and saved as metadata. In the planning process control elements are required for identifying correct plans in presence of hierarchically arranged learning objects. 42

The following sections describe the design of the mediator component with its class structure, the query capability manager and the detailed process of query processing.

5.1

Design of the Mediator Component

An overview of the class and package structure of the mediator component is given in Figure 5.1. The Mediator class acts as the facade class for the component and controls the execution of the other classes. A query is transferred from the front-end JSP page to the Mediator class, where it is handed over to the RDQLQueryParser class for parsing. The result of the RDQL parsing process is used in the PrologPlanner class as an entry parameter. The planning process returns a set of feasible plans to the Mediator class. These plans are executed by the PlanExecutor class. Because the integration of RDF models from various sources is realised during the plan execution process, we do not need an extra class for this purpose. All classes concerned with query processing are gathered in the package query processing.
mediator
Mediator
<<uses>> <<uses>>

manager
QCManager

query_processing
CEManager

parsing

planning
<<uses>> <<uses>>

RDQLQueryParser

PrologPlanner

execution
<<uses>>

qc_datastrucure
Enquery

PlanExecutor QueryCapability

Figure 5.1: Design of the Mediator Component Besides its responsibility for controlling the query processing task, the Mediator class forwards all requests concerning the management of metadata to the classes in the package manager. Within the package there are two classes named QCManager and CEManager. Whereas the QCManager handles query capabilities the CEManager is in charge for control elements. All classes depicted in Figure 5.1 use the qc datastructure package which provides the data structure of the component based on our representation of query 43

capabilities.

5.2

Managing Query Capabilities

The query interfaces of the information sources that are registered in the MIWeb system are described by query capabilities. It is not possible to get an answer to any possible query on the mediator schema from each source because the sources oer only limited query capabilities. This fact has to be attended by the mediator during query processing. All query capabilities in the MIWeb system comprise the URL of the wrapper interface and two lists with the input parameters and output attributes of the information source. Example: The Google wrapper needs as input parameter a title and returns an RDF model that contains the URIs (in RDF terms rdf:about) and descriptions of all documents that are found. The following query capability is registered in the mediator system: URL: http://localhost:8080/google/googleWrapper.jsp Parameters: http://purl.org/dc/elements/1.1/title Results: rdf:about,
http://purl.org/dc/elements/1.1/description

Capabilities of the information sources are managed with the QCManager class. The QCManager oers four public methods as shown in Figure 5.2. Note, that only the methods registerQC(), deleteQC() and deletePropertiesFile() are accessible from outside via the JSP interface that forwards the instructions through the Mediator class to the QCManager. (The relevant part of the mediator is also shown in Figure 5.2.)
QCManager __________________________ __________________________ initQCList() registerQC() deleteQC() deletePropertiesFile() Mediator __________________________ __________________________ QC Management register() delete() deleteAll() Query Processing execute()

Figure 5.2: Public Methods of the QCManager and Mediator Classes The methods of the QCManager have the following functionality: initQCList() makes saved query capability metadata available to the mediator. By specifying the URL of the source, its parameter and result attributes registerQC(String url, String parameters, String results) announces a new 44

query capability of an information source to the system. This method resembles the register(String url, String parameters, String results) method of the Mediator class. deleteQC(String url) clears all query capabilities of a source with the dened URL. The resemblance is the delete(String url) method of the Mediator class. deletePropertiesFile() erases the complete query capability database. This method corresponds to the method deleteAll() of the mediator class.

5.3
5.3.1

Query Processing
Data Structure for Query Processing
qc_datastructure
Enquery QueryCapability QueryCapability List

ParameterList

ResultList

CtrlElemList

Parameter

Result

CtrlElem

Attribute

Figure 5.3: Class Structure of the qc datastructure Package The most important concepts of the data structure with respect to query processing are query capabilities, queries and plans. A query capability describes the capabilities of a wrapper interface to answer specic queries. Therefore, the QueryCapability class contains a URL of the wrapper and lists of parameters and result attributes that the source supports (see Figure 5.3). An example was given in the last section. User queries are constructed in a similar way. The Enquery class also comprises lists of parameters and results which the mediator has to attend during plan generation and execution. Besides this, queries contain explicit values for their

45

parameters. These are necessary as input for the plan execution process. In contrast to the QueryCapability class, the URL is missing, because a user query is handled locally in the mediator and is not related to a particular source. A plan is represented by a list of QueryCapability instances. In the process of plan execution these interface descriptions are transformed into RDQL queries that are sent to the specic wrapper interfaces qualied by the stored URL. The detailed process of plan generation and execution is discussed in next subsections.

5.3.2

Query Planning

Query Planning is realised by querying a Prolog database which is dynamically created by utilisation of the registered query capabilities. Each query capability is converted into as many Prolog rules as there are results in the query capabilty. Each result represents a head of one rule whereas the parameters of the corresponding query capability are contained in the tail. Example: The query capability that describes the Google wrapper in the example of section 5.2 is converted to the following Prolog rules (simplied representation):
rdf:about :- title description :- title
query (ParameterList, ResultList) T take result from ResultList

[ result == null ]

return planlist [ not possible ]

search for path from result to parameter (of ParameterList)

intermediate step: result := intermediate parameter

[ possible ] [ no path found ] plan with intermediate step

[ path found ] [ not all hierarchy elements in path ]

[ all hierarchy elements in path found ] plan found: add plan to planlist

Figure 5.4: Detailed View of the Prolog Planning Algorithm

46

After a user query is parsed by the RDQLQueryParser class, the PrologPlanner class is responsible for posing the query on the Prolog database. This is accomplished by invoking the SWI Prolog interpreter1 with appropriate parameters on the database. The resulting plan is transferred from Prolog to a Java representation using the JPL Java Prolog interface that is part of the SWI Prolog distribution. The planning algorithm rst builds up partial plans for each result attribute in the user query using the query capabilities of the wrappers. The idea is the following: We are looking for paths of query capabilities that start at the single result attribute and terminates at some of the parameters given in the query (see Figure 5.4). If a path contains more than one query capability, the plan will connect data from dierent sources to determine the result attribute. Note, that this a dicult point: we cannot combine sources without dened relationships between them. Take an example: We want to know the titles of learning objects from a given institution. The rst source takes an institution, but only provides the year of publication. But there exists a second source that provides titles of documents taking the year of publication. So, can we combine these two sources to answer our query? Obviously, we cannot! We would get titles of documents published in the same year as the learning objects from our institution, but it could be titles of documents of someone else. This example shows that we have to be careful when combining query capabilities. We use the fusion attributes introduced in Section 2.1.2 for this purpose. They are a kind of global key so that all attributes determined of a following source within a plan will belong to the same object. The year of publication used in the example above isnt such a fusion attribute, so that sources will not be connected this way. Another problem in query processing are queries that aect attributes of dierent e-Learning objects that are related. We introduce hierarchical elements to be able to attach attributes to the correct objects. Thus, these elements support the planning process over hierarchies of referenced documents that are arbitrarily networked. The query capability rules are applicable on every hierarchical level of documents. Both hierarchical elements and global fusion attributes are types of control elements. Within the partial plans it can be examined whether all parameters of the user query are involved in the planning process. Furthermore, the entirety of partial plans can be tested if all requested results appear at least once. These tests are optional - the user may be satised with a partial answer to his query. Finally, the partial plans are combined and joined to one cumulative plan (see Figure 5.5).
1 http://www.swi-prolog.org/

47

planlist = query (ParameterList, ResultList)

check global fusion attributes

parameter test

result test

build plan by joining elementary plans

return modified planlist

Figure 5.5: Overview of the Prolog Planning Algorithm

5.3.3

Plan Execution and Result Integration

The query capabilities contained in a plan are executed sequentially. The mediator class is able to transform query capabilities to RDQL queries and send them to the registered sources. The planning process assures that queries are provided with suitable parameters. They may origin from the user query or from the result of a previous source query. The results of source queries consist of RDF models that are integrated into an aggregated RDF model. For the integration the fusion attribute values of the recent RDF model and the cumulative RDF model have to be compared to each other. Because of the hierarchical network of the documents the referenced level has to be attended during this integration step. The RDF term rdf:about that stands for the URI of a document has to be treated in a special manner during the whole plan execution and integration process. This is due to its special meaning in RDF models. While the other parameters and results are accessed via a path of properties, a path with the term rdf:about contains a resource description as last clause. This has to be considered during model processing. The last action of plan execution is to present the aggregate RDF model to the user.

48

Chapter 6

Conclusion
6.1 Experiences

In our experience a metadata-based integration of data sources using a wrappermediator architecture ts the requirements of searching the web: Firstly, metadata standards are a good mean to bridge heterogeneous sources. Secondly, the approach of query capabilities and mapping rules allows one to integrate new data sources and change wrapper services dynamically. This way, the autonomy of web sources is taken into account. Furthermore, the implementation can be reused to build mediators for dierent domains. But there are also disadvantages of such a domain-independent solution of a mediator component. The query planning determines all possible plans of querying the integrated sources to get some data that is relevant for the query. Domainspecic optimizations are not possible. This can result in long answer times if big overlapping web sources like search engines are integrated into the system. Therefore we propose mediators to answer to domain-specic queries using more general sources only to enrich the query results of specic ones. For example, our prototype should be used for queries about learning material, not for a simple keyword search. In detail, we made experiences with metadata standards and the technologies usually used within the web context. The most important points are discussed in the following. RDF as common data model. The RDF format and the related RDF schema are well suited for semistructured web data and its describing metadata. Designing a mediator-based information system one should consider that the chosen data model has eects on the domain-specic query capabilities and mapping rules as well as on the query execution and result integration of the mediator component. The integration rules have to be dened for the dierent modelling concepts of the data model. The collection types and structuring concepts of RDF require complex integration rules to handle all possible integration 49

conicts. In addition, the distinction of resources (the URL of our learning objects) and properties in RDF has to be considered. As RDF is a relative new modeling language, we miss some common rules for modelling as they exist for well known languages like entity-relationship-modeling. This lack results in many possible structural conicts between autonomous RDF models that have to be solved within the mappings. Query languages and RDQL. There are many query languages for RDF. We used RDQL because it is very readable since it is quite similar to the notation of SQL. In distributed information systems the semantics of the query language and the query capabilities should be precisely dened, especially the logic between selection and projection attributes. Using RDQL a tuple is only part of the result if it contains all of the requested attributes. In federated information systems this is not appropriate: The user usually wants any information that he could get, even if some of the required attributes are missing. In RDQL as in other existing query languages the user cannot specify if given attributes should occur or not. In our opinion, this would be a useful extension in the context of querying the web. Metadata standards as mediator schemas. Our mediator is based on the LOM standard. This standard includes attributes to describe learning objects as well as value spaces for these attributes. When integrating data from heterogeneous sources both parts have to be considered. So, the mediator schema should always contain denitions about the data representation to enable the system to compare data objects and to identify data conicts. Usually a given metadata standard does not fulll all requirements for describing data. Therefore, most of the projects extend the metadata set. Our mediator only supports the core LOM metadata set to reduce costs for integration by dening mapping rules. However, we also detailed the value space of some of the LOM attributes. For these dierences we also need (very simple) mapping rules to integrate even LOM compatible data sources. The most eort is needed to adapt a metadata standard to the context of a mediator. In our example we want to support queries on learning objects and related authors, publications etc. As the LOM standard concentrates on learning objects it has to be combined with other standards or schemas, e.g. to describe publications and their quality. Here it would be nice to have a standard method or extension points within metadata standards for combination with others. Then, the building of mediator schemas as an application-specic ontology based on more general ones would be possible. From a technical point of view we see that the denition of the mediator schema contains a semantical part and a structural part: LOM denes the semantic concepts of the mediator, the concrete mapping to a RDF schema denes the structural representation of these concepts. Existing LOM compatible sources use dierent representations so that they require a mapping although they use the same terminology. In our system the

50

mapping rules relate semantic concepts and map their representation as well, but it would be nice to separate structural and semantical aspects.

6.2

Discussion of the MIWeb Architecture

The MIWeb system shows how mediator architectures could be used to build high-quality engines searching the web. It improves search engines by providing semantically richer query interfaces, automatically combining sources and eliminating redundant results. The system is based on wrappers encapsulating web sources, query capabilities describing the wrapper services and a domain-independent query planning algorithm that determines plans of wrapper queries that give some answers to a given query. This domain-independent solution allows exible changes of the system conguration as well as the reuse of components in other contexts. For this purpose the MIWeb system takes care of main challenges in heterogeneous information systems, particularly within the World Wide Web: the encapsulation of unstructured web sources only providing a simple query interface (wrapper components); the integration of heterogeneous data by mapping them to a standardized metadata schema (mapper component); the domain-independent query processing that enables the system to connect data from dierent sources and to eliminate redundant results (mediator component). Wrapping of Web Sources. Wrappers in the MIWeb system have to encapsulate a web source according to the requirements of the mediator: They have to provide an interface that allows RDQL queries and returns RDF data according to the LOM standard. Web sources usually do not provide an adequate interface so that the wrappers have to solve these heterogeneities. Main problems in this context is the data extraction of HTML sources and the way how to deal with restrictive web source interfaces. The former problem is well-known in literature ([LRNST02]). We used a grammarbased approach. For the wrapping of restrictive interface we introduced two main concepts: The query tunneling approach allows us to wrap restrictive web interfaces in two steps. Firstly, we use query relaxation to map a given query to the source interface. Secondly, we execute the original query to the result of the relaxed query to restrict the result set to the correct elements. Query tunneling also improves the interface of restrictive web sources because all attributes that are part of the source data can now be used in queries even if they could not be used in source queries yet.

51

Query serialization enables us to deal with operational restrictions of the web source, for example if the query language diers in its richness, if the source returns its data in several parts, or if the source data is distributed over several web pages. The Google/Roodolf wrapper of the MIWeb system shows how query serialization can be used to map RDQL queries to (possibly more than one) Datalog queries. We described a general wrapper architecture with several layers. Each of them tackles a specic problem that can occur when wrapping a web source. It allows one to easily identify which components are necessary for a specic source. Starting from our prototype implementation of the MIWeb system we will make further investigations on specic wrapping problems, especially on query relaxation for dierent kind of queries (for example we did not considered range queries yet); on query serialization and result integration, particularly on wrapping sources using more than one page for representing their query results; on automatically generating grammars for data extraction similar to [CMM01] or [NSS02]; on semi-automatically building and maintaining wrappers. Mapping of RDF Data. The integration of heterogeneous web sources (and its wrappers) is based on a pre-dened mediator schema dened upon the existing metadata standard LOM. Obviously, sources can use dierent data representations so that a transformation is needed. We propose the concepts of model correspondences and mappings for this purpose. A model correspondence denes a same as relationship between elements of dierent schemas. It can be translated into a mapping that enables us to perform the data transformation. The explicit specication of model correspondences as system metadata is one pre-requisite for the dynamic integration of new data sources. The MIWeb mapper component follows this exible approach. It is based on a metamodel for model correspondences. The transformation of RDF data is realized by specic algorithms operating on the RDF models. In a similar way a query transformation could be realized in the future. Although a classical mediator-wrapper-architecture doesnt dene an explicit mapper component, we do it because the mapper component can now be used on mediator level as well as on wrapper level. Many architectures (like [PAGM96, HZ96, MP03]) allows wrapper administrators to specify their wrapper interfaces in their specic schema terminology so that the mediator is responsible for the required mapping. Other architectures (as our MIWeb system) only allow them to integrate wrappers according to the mediator schema. Then, the wrapper component has to perform a data transformation. The metamodel of model correspondences as well as the mapper implementation were developed according to the requirements of our prototype 52

system. Although weve already made experiments in using them in other contexts they should be revised in future. In particular, we will address two aspects: the tool support for the specication of model correspondences. Thereby, the interoperation to other techniques in correspondence discovery and management (like [SRB03, DMD+ 03, RB01]) has to be taken into account. the revision of the mapper implementation by alternative techniques for data transformation. We will consider an implementation based on logical rules as well as the translation of model correspondences into existing mapping languages like XSLT for XML. Mediation. The mediation contains three important aspects: Firstly, the mediator denes the structure of query capabilities that allows administrators to integrate their wrappers into the system. Secondly, it realizes the query processing by rewriting a given query to plans of queries that can be sent to the underlying wrappers, possibly by combining them. Thirdly, the mediator executes queries on the sources and integrates the results. On the one side, the MIWeb mediator demonstrates that a exible integration of web sources is possible. The mediator allows us both to develop many dierent scenarios in the e-Learning domain upon it and to build similar mediators for other domains. On the other side, each of the aspects named above concerns challenges in federated information systems that are not solved yet so that many further investigations have to be done. Regarding the mediator of the MIWeb system this includes the description of wrapper interfaces: the query capability structure of MIWeb is very simple. For example it only describes the query schema without further information about the source content and isnt able to specify optional parameters. Here, we will analyse if service descriptions like WSDL ([W3C01]) can be used for our purposes. the optimization of query planning: we already mentioned that our goal is to nd all possible plans what is not appropriate for largescaled systems. We will use optimization techniques from other query rewriting approaches ([LMSS95, PAGM96]) to improve the algorithm. the combination of data sources / query capabilities: the combination of sources is based on fusion attributes. Obviously, it is quite dicult to specify such global keys in heterogeneous systems. In addition, the MIWeb system uses a string comparison to match the values of fusion attributes. Due to data conicts this approach often fails. In future, we want to integrate comparison operators that allows us to dene domain-specic similarity relationships. Nevertheless, our MIWeb prototype shows how concepts of data integration and federated information systems can be used to build a domain-specic mediator system. It provides information on learning material and related publications 53

using domain-specic learning material sources, the research index Citeseer, and the web search engine Google. It uses web technology and RDF for integration. Besides of the specic aspects discussed above our further work will address the development of mediators in other domains reusing MIWeb components, examinations on other common data models like XML, extensions of query capabilities and query languages to consider more semantics useful in heterogeneous information systems, the integration and usage of other technologies like web services and P2P infrastructures, and the development of tools for generation and evolution of wrappers and mapping rules.

54

Appendix A

Model Correspondences
This appendix denes the XML schema of the metamodel for model correspondences that was introduced in Chapter 4. As an example of its use we also list the model correspondences for the NewEconomy and Citeseer sources according to this schema. The structure follows their description in Chapter 4.

A.1

XML Schema for Model Correspondences

<?xml version="1.0"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation> <xsd:documentation xml:lang="en"> Schema for MIWeb Correspondences Copyright 2004 Susanne Busse TU-Berlin. All rights reserved. </xsd:documentation> </xsd:annotation> <!-- ------------ Correspondence Types ------------------- --> <xsd:complexType name="CorrespondenceType"> <!-- optional name for correspondences for readable messages. --> <xsd:attribute name="Name" type="xsd:string"/> </xsd:complexType> <xsd:complexType name="SimpleCorrespondenceType"> <xsd:complexContent> <xsd:extension base="CorrespondenceType"> <xsd:sequence> <xsd:element name="LeftPath" type="xsd:string" minOccurs="1" maxOccurs="1"/> <xsd:element name="RightPath" type="xsd:string" minOccurs="1" maxOccurs="1"/> </xsd:sequence>

55

</xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="ComplexCorrespondenceType"> <xsd:complexContent> <xsd:extension base="CorrespondenceType"> <xsd:sequence> <xsd:element name="CorrespondenceStep" type="Correspondence" minOccurs="2" maxOccurs="unbounded"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="ValueCorrespondenceType"> <xsd:complexContent> <xsd:extension base="SimpleCorrespondenceType"> <xsd:sequence> <xsd:element name="LeftValue" type="xsd:string" minOccurs="1" maxOccurs="1"/> <xsd:element name="RightValue" type="xsd:string" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:extension> </xsd:complexContent> </xsd:complexType> <xsd:complexType name="Correspondence"> <xsd:choice> <!-- sameAs - elements are the same in the schemas. The element is only written once. --> <xsd:element name="SameAs" type="SimpleCorrespondenceType"/> <!-- elementCorrespondence - elements correspond but differ in their tag/path name --> <xsd:element name="ElementCorrespondence" type="SimpleCorrespondenceType"/> <!-- valueCorrespondence - element values correspond but differ in their tag/path name or in their value. --> <xsd:element name="ValueCorrespondence" type="ValueCorrespondenceType"/> <!-- sequentialCorrespondence - more than one correspondence type must be applied for the mapping. --> <xsd:element name="SequentialCorrespondence" type="ComplexCorrespondenceType"/> <!-- alternativeCorrespondence - more than one correspondence type is defined for a model element. --> <xsd:element name="AlternativeCorrespondence" type="ComplexCorrespondenceType"/>

56

</xsd:choice> </xsd:complexType>

<!-- ---------- Model Correspondence Store --------------- --> <xsd:element name="ModelCorrespondenceStore" minOccurs="1" maxOccurs="1"> <xsd:complexType> <xsd:sequence> <xsd:element name="ModelCorrespondence" minOccurs="0" maxOccurs="unbounded" > <xsd:complexType> <xsd:sequence> <!-- LeftSchema and RightSchema identify the corresponding schemas. The names are used to reference the schema in mapping calls. --> <xsd:element name="LeftSchema" type="xsd:string" minOccurs="1" maxOccurs="1" /> <xsd:element name="RightSchema" type="xsd:string" minOccurs="1" maxOccurs="1" /> <!-- a model correspondence contains one or more correspondence. --> <xsd:element name="ModelCorrespondenceCorr" type="Correspondence" minOccurs="1" maxOccurs="unbounded" /> </xsd:sequence> <!-- URL identifies the tag name used for resources in the RDF model. --> <!-- if URL is not set, the standard encoding RDF+"about" is used. --> <xsd:attribute name="URL" type="xsd:string"/> <!-- ListItem identifies the tag name used for references of collection items. --> <xsd:attribute name="ListItem" type="xsd:string"/> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema>

A.2

NewEconomy Mediator

<ModelCorrespondence URL="url" ListItem="li"> <LeftSchema>NewEconomy</LeftSchema> <RightSchema>Mediator</RightSchema>

<!-- 1.2. General.Title: same as correspondence --> <ModelCorrespondenceCorr> <SameAs> <LeftPath>http://purl.org/dc/elements/1.1/title</LeftPath>

57

<RightPath>http://purl.org/dc/elements/1.1/title</RightPath> </SameAs> </ModelCorrespondenceCorr>

<!-- 1.3. General.Language: same as correspondence --> <ModelCorrespondenceCorr> <SameAs> <LeftPath>http://purl.org/dc/elements/1.1/language http://purl.org/dc/terms/RFC1766 http://www.w3.org/1999/02/22-rdf-syntax-ns#value </LeftPath> <RightPath>http://purl.org/dc/elements/1.1/language http://purl.org/dc/terms/RFC1766 http://www.w3.org/1999/02/22-rdf-syntax-ns#value </RightPath> </SameAs> </ModelCorrespondenceCorr>

<!-- 1.4. General.Description: same as correspondence --> <ModelCorrespondenceCorr> <SameAs> <LeftPath>http://purl.org/dc/elements/1.1/description</LeftPath> <RightPath>http://purl.org/dc/elements/1.1/description</RightPath> </SameAs> </ModelCorrespondenceCorr>

<!-- 1.5. General.Keyword: same as correspondence --> <ModelCorrespondenceCorr> <SameAs> <LeftPath>http://purl.org/dc/elements/1.1/subject http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag li </LeftPath> <RightPath>http://purl.org/dc/elements/1.1/subject http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag li </RightPath> </SameAs> </ModelCorrespondenceCorr>

<!-- 1.8. General.Aggregation Level: value correspondences --> <ModelCorrespondenceCorr> <ValueCorrespondence> <LeftPath> http://nutria.cs.tu-berlin.de/NE/rdfs#granularitytype url </LeftPath> <RightPath> http://ltsc.ieee.org/2002/09/lom-general#aggregationLevel url </RightPath> <LeftValue> http://nutria.cs.tu-berlin.de/NE/rdfs#module </LeftValue>

58

<RightValue>AggregationLevel3</RightValue> </ValueCorrespondence> </ModelCorrespondenceCorr> <ModelCorrespondenceCorr> <ValueCorrespondence> <LeftPath> http://nutria.cs.tu-berlin.de/NE/rdfs#granularitytype url </LeftPath> <RightPath> http://ltsc.ieee.org/2002/09/lom-general#aggregationLevel url </RightPath> <LeftValue> http://nutria.cs.tu-berlin.de/NE/rdfs#component </LeftValue> <RightValue>AggregationLevel2</RightValue> </ValueCorrespondence> </ModelCorrespondenceCorr> <ModelCorrespondenceCorr> <ValueCorrespondence> <LeftPath> http://nutria.cs.tu-berlin.de/NE/rdfs#granularitytype url </LeftPath> <RightPath> http://ltsc.ieee.org/2002/09/lom-general#aggregationLevel url </RightPath> <LeftValue> http://nutria.cs.tu-berlin.de/NE/rdfs#physicalelement </LeftValue> <RightValue>AggregationLevel1</RightValue> </ValueCorrespondence> </ModelCorrespondenceCorr>

<!-- 2.3.2 LifeCycle.Contribute.Entity for author: element correspondence --> <ModelCorrespondenceCorr> <ElementCorrespondence> <LeftPath>http://purl.org/dc/elements/1.1/creator</LeftPath> <RightPath>http://purl.org/dc/elements/1.1/creator http://ltsc.ieee.org/2002/09/lom-base#Entity http://www.w3.org/2001/vcard-rdf/3.0#FN </RightPath> </ElementCorrespondence> </ModelCorrespondenceCorr>

<!-- 2.3.2 LifeCycle.Contribute.Entity for publisher: element correspondence --> <ModelCorrespondenceCorr> <ElementCorrespondence> <LeftPath>http://purl.org/dc/elements/1.1/publisher</LeftPath> <RightPath>http://purl.org/dc/elements/1.1/publisher

59

http://ltsc.ieee.org/2002/09/lom-base#Entity http://www.w3.org/2001/vcard-rdf/3.0#FN </RightPath> </ElementCorrespondence> </ModelCorrespondenceCorr>

<!-- 4.1. Technical.Format: same as correspondence --> <ModelCorrespondenceCorr> <SameAs> <LeftPath> http://purl.org/dc/elements/1.1/format http://purl.org/dc/terms/IMT http://www.w3.org/1999/02/22-rdf-syntax-ns#value </LeftPath> <RightPath>http://purl.org/dc/elements/1.1/format http://purl.org/dc/terms/IMT http://www.w3.org/1999/02/22-rdf-syntax-ns#value </RightPath> </SameAs> </ModelCorrespondenceCorr>

<!-- 4.2. Technical.Size: same as correspondence --> <ModelCorrespondenceCorr> <SameAs> <LeftPath> http://purl.org/dc/terms/extent http://ltsc.ieee.org/2002/09/lom-technical#ByteSize http://www.w3.org/1999/02/22-rdf-syntax-ns#value </LeftPath> <RightPath> http://purl.org/dc/terms/extent http://ltsc.ieee.org/2002/09/lom-technical#ByteSize http://www.w3.org/1999/02/22-rdf-syntax-ns#value </RightPath> </SameAs> </ModelCorrespondenceCorr>

<!-- 5.2. Educational.Learning Resource Type: element correspondence --> <ModelCorrespondenceCorr> <ElementCorrespondence> <LeftPath> http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag li </LeftPath> <RightPath> http://www.w3.org/1999/02/22-rdf-syntax-ns#type url </RightPath> </ElementCorrespondence> </ModelCorrespondenceCorr>

60

<!-- 5.5. Educational.Intended End User Role: element correspondence --> <ModelCorrespondenceCorr> <ElementCorrespondence> <LeftPath> http://ltsc.ieee.org/2002/09/lom-educational#intendedEndUserRole url </LeftPath> <RightPath> http://ltsc.ieee.org/2002/09/lom-educational#intendedEndUserRole http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag li url </RightPath> </ElementCorrespondence> </ModelCorrespondenceCorr>

<!-- 5.9. Educational.Typical Learning Time: same as correspondence --> <ModelCorrespondenceCorr> <SameAs> <LeftPath> http://ltsc.ieee.org/2002/09/lom-educational#typicalLearningTime http://ltsc.ieee.org/2002/09/lom-base#ISO8601 http://www.w3.org/1999/02/22-rdf-syntax-ns#value </LeftPath> <RightPath> http://ltsc.ieee.org/2002/09/lom-educational#typicalLearningTime http://ltsc.ieee.org/2002/09/lom-base#ISO8601 http://www.w3.org/1999/02/22-rdf-syntax-ns#value </RightPath> </SameAs> </ModelCorrespondenceCorr>

<!-- 7.2 Relation.Resource Identifier for Kind is part of: alternative correspondence with two element correspondences --> <ModelCorrespondenceCorr> <AlternativeCorrespondence> <CorrespondenceStep> <ElementCorrespondence> <LeftPath> http://purl.org/dc/terms/isPartOf http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag li </LeftPath> <RightPath> http://purl.org/dc/terms/isPartOf http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag li url </RightPath> </ElementCorrespondence> </CorrespondenceStep> <CorrespondenceStep> <ElementCorrespondence>

61

<LeftPath> http://purl.org/dc/terms/isPartOf url </LeftPath> <RightPath> http://purl.org/dc/terms/isPartOf http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag li url </RightPath> </ElementCorrespondence> </CorrespondenceStep> </AlternativeCorrespondence> </ModelCorrespondenceCorr>

<!-- 7.2 Relation.Resource Identifier for Kind has part: element correspondence --> <ModelCorrespondenceCorr> <ElementCorrespondence> <LeftPath> http://purl.org/dc/terms/hasPart url </LeftPath> <RightPath> http://purl.org/dc/terms/hasPart http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag li url </RightPath> </ElementCorrespondence> </ModelCorrespondenceCorr> </ModelCorrespondence>

A.3

Citeseer Mediator

<ModelCorrespondence ListItem="li"> <LeftSchema>CiteSeer</LeftSchema> <RightSchema>Mediator</RightSchema>

<!-- 1.2. General.Title: same as correspondence --> <ModelCorrespondenceCorr> <SameAs> <LeftPath>http://purl.org/dc/elements/1.1/title</LeftPath> <RightPath>http://purl.org/dc/elements/1.1/title</RightPath> </SameAs> </ModelCorrespondenceCorr>

<!-- 1.4. General.Description: same as correspondence --> <ModelCorrespondenceCorr> <SameAs> <LeftPath> http://purl.org/dc/elements/1.1/description </LeftPath> <RightPath>

62

http://purl.org/dc/elements/1.1/description </RightPath> </SameAs> </ModelCorrespondenceCorr>

<!-- 2.3.2 LifeCycle.Contribute.Entity for author: element correspondence --> <ModelCorrespondenceCorr> <ElementCorrespondence> <LeftPath> http://purl.org/dc/elements/1.1/creator </LeftPath> <RightPath> http://purl.org/dc/elements/1.1/creator http://ltsc.ieee.org/2002/09/lom-base#Entity http://www.w3.org/2001/vcard-rdf/3.0#FN </RightPath> </ElementCorrespondence> </ModelCorrespondenceCorr>

<!-- 2.3.2 LifeCycle.Contribute.Date for author: element correspondence --> <ModelCorrespondenceCorr> <ElementCorrespondence> <LeftPath> http://purl.org/dc/elements/1.1/date </LeftPath> <RightPath> http://purl.org/dc/terms/created http://purl.org/dc/terms/W3CDTF http://www.w3.org/1999/02/22-rdf-syntax-ns#value </RightPath> </ElementCorrespondence> </ModelCorrespondenceCorr>

<!-- 6.3 Annotation.Description: element correspondence --> <ModelCorrespondenceCorr> <ElementCorrespondence> <LeftPath> http://grizzly.cs.tu-berlin.de:8000/citeseer/elements/citations </LeftPath> <RightPath> http://ltsc.ieee.org/2002/09/lom-annotation#annotation http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag li http://ltsc.ieee.org/2002/09/lom-annotation#Annotation http://grizzly.cs.tu-berlin.de/mediator/schema#citations </RightPath> </ElementCorrespondence> </ModelCorrespondenceCorr> </ModelCorrespondence>

63

Bibliography
[BK04] S. Busse and R.-D. Kutsche. Model-based Information Integration as a Pre-requisite for Intra- and Inter-Enterprise Information Flows. In Proceedings of the 2nd Ljungby Workshop on Information Logistics The Knowledge Gap in Enterprise Information Flow, Centrum for Informationslogistik, Ljungby, Sweden, Sept. 16-17 2004. S. Busse, R.-D. Kutsche, U. Leser, and H. Weber. Federated Information Systems: Concepts, Terminology and Architectures. Forschungsberichte des Fachbereichs Informatik Nr. 99-9, Technische Universitt a Berlin, 1999. T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientic American, Feature Articles, May 2001. S. Busse and C. Pons. Schema Evolution in Federated Information Systems. In A. Heuer, F. Leymann, and D. Priebe, editors, Datenbanksysteme in Bro, Technik und Wissenschaft, Informatik aktuell, pages 26 u 43. Springer, 2001. Susanne Busse. Modellkorrespondenzen fr die kontinuierliche Entwicku lung mediatorbasierter Informationssysteme. PhD thesis, TU Berlin, Fakultt IV Elektrotechnik und Informatik, 2002. Logos Verlag Berlin. a R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. W. W. Chu and Q. Chen. A structured approach for cooperative query answering. IEEE Transactions on Knowledge and Data Engineering, 6(5):738749, 1994. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Daniele Nardi, and Riccardo Rosati. A principled approach to data integration and reconciliation in data warehousing. In Proceedings of the Intl. Workshop on Design and Management of Data Warehouses, DMDW99, volume 19 of CEUR Workshop Proceedings, pages 161 1611, 1999. Chia-Hui Chang. IEPAD: Information extraction based on pattern discovery. In Tenth International World Wide Web Conference, pages 681 687, 2001. W. Cohen, M. Hurst, and L. Jensen. A exible learning system for wrapping tables and lists in html documents. In The Eleventh International World Wide Web Conference WWW-2002, 2002. Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, pages 109118, 2001.

[BKLW99]

[BLHL01] [BP01]

[Bus02]

[BYRN99] [CC94]

[CGL+ 99]

[Cha01]

[CHJ02]

[CMM01]

64

[Com02]

IEEE Learning Technology Standards Committee. Standard for Information Technology Education and Training Systems Learning Objects and Metadata. Technical report, IEEE, 2002. S. Conrad. Fderierte Datenbanksysteme: Konzepte der Datenintegrao tion. Springer, 1997. Dublin Core Metadata Initiative. Dublin Core Metadata Element Set, version 1.1: Reference Description. DCMI Recomandation 1999-07-02, 1999. AnHai Doan, Jayant Madhavan, Robin Dhamankar, Pedro Domingos, and Alon Halevy. Learning to match ontologies on the semantic web. The VLDB Journal, 12(4):303319, 2003. Federal Geographic Data Committee. Content Standards for Digital Geospatial Metadata. Version 2.0, FGDC-STD-001-1998, Jun. 1998. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns, Elements of Reusable Object-Oriented Software. AddisonWesley, 1995. Christoph Gldner, Thomas Kabisch, and Jrn Guy S. Developing roo o u bust wrapper-systems with content based recognition. In Philippe Thiran and Willem-Jan van den Heuvel, editors, Proc. First International Workshop on Wrapper Techniques for Legacy Systems WRAP 2004, in connection with the 11th IEEE Working Conference on Reverse Engineering WCRE, pages 4454, Technische Universiteit Eindhoven, 2004.

[Con97] [DCMI99]

[DMD+ 03]

[FGDC98] [GHJV95]

[GKS04]

[GLdSRN00] Paulo Braz Golgher, Alberto H. F. Laender, Altigran Soares da Silva, and Berthier A. Ribeiro-Neto. An example-based environment for wrapper generation. In ER 00: Proceedings of the Workshops on Conceptual Modeling Approaches for E-Business and The World Wide Web and Conceptual Modeling, pages 152164. Springer-Verlag, 2000. [Hal01] [HZ96] Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270294, 2001. Richard Hull and Gang Zhou. A framework for supporting data integration using the materialized and virtual approaches. In SIGMOD 96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pages 481492. ACM Press, 1996. ISO International Standard Organisation. Industrial automation systems and integration Product data representation and exchange part 11: Description methods: The EXPRESS language reference manual. ISO Standard 10303 part(11), 2nd ed., ISO TC184/SC4, Nov. 2004. Florian Jung, Ralf-D. Kutsche, and Dirk Rother. Robust Wrapping in Mediator-based Information Systems. In Anne E. James and Muhammad Younas, editors, Technical Report of the 20th British National Conference on Databases, BNCOD 20, Coventry, UK, Poster Papers, pages 2023, School of Mathematical and Information Sciences, Coventry University, July 2003. T. Kabisch. Grammatikbasiertes semantisches Wrapping fr fderierte u o Informationssysteme. In Tagungsband zum 15. GI-Workshop Grundlagen von Datenbanken, pages 6266, 2003. Dongwon Lee. Query Relaxation for XML Model. PhD thesis, University of California, Los Angeles, 2002. Ulf Leser. Query Planning in Mediator Based Information Systems. PhD thesis, TU Berlin, Fachbereich Informatik, 2000.

[ISO04]

[JKR03]

[Kab03]

[Lee02] [Les00]

65

[LGH02]

Alexander Lser, Christian Grune, and Marcus Homann. A didactic o model, denition of learning objects and selection of metadata for an online curriculum. In Michael E. Auer and Ursula Auer, editors, 5th. International Workshop Interactive Computer aided Learning (ICL). Carinthia Tech Institute, Kassel University Press, 2002. Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, and Divesh Srivastava. Answering queries using views. In Proceedings of the 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 95104, San Jose, Calif., 1995. A. Laender, B. Ribeiro-Neto, A. Silva, and J. Teixeira. A brief survey of web data extraction tools. SIGMOD Record, 31(2), June 2002. Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Querying heterogeneous information sources using source descriptions. In VLDB 96: Proceedings of the 22th International Conference on Very Large Data Bases, pages 251262. Morgan Kaufmann Publishers Inc., 1996. R.J. Miller. Using schematically heterogeneous structures. ACM SIGMOD Record, 27(2):189200, 1998. A. Magkanaraki, G. Karvounareakis, T.T. Anh, V. Christophides, and D. Plexousakis. Ontology Storage and Querying. Technical Report 308, Foundation for Research and Technology Hellas, Institute of Computer Science, Information Systems Laboratory, April 2002. A. Motro. Flex: A tolerant and cooperative user interface to databases. IEEE Transactions on Knowledge and Data Engineering, 2(2):231246, 1990. Peter McBrien and Alexandra Poulovassilis. Data integration by bidirectional schema transformation rules. In Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India, pages 227238. IEEE Computer Society, 2003. M. Nilsson, M. Palmer, and J. Brase. The LOM RDF binding principles and implementation. In 3rd Annual Ariadne Conference, Katholieke Universiteit Leuven, Belgium, November 2003. A draft of the RDF binding can be found at http://kmr.nada.kth.se/el/ims/metadata.html. Mattis Neiling, Markus Schaal, and Martin Schumann. Wrapit : Automated integration of web databases with extensional overlaps. In Web, Web-Services, and Database Systems, NODe 2002 Web and DatabaseRelated Workshops, Erfurt, Germany, volume 2593 of Lecture Notes in Computer Science (LNCS), pages 184198. Springer, 2002. W. Nejdl, B. Wolf, C. Qu, M. Sintek, A. Naeve, M. Nilsson, and M. Palmer. EDUTELLA: A P2P Networking Infrastructure Based on RDF. In Proc. 11th Int. World Wide Web Conference, WWW2002, Honolulu, Hawaii. ACM, May 2002. Yannis Papakonstantinou, Serge Abiteboul, and Hector Garcia-Molina. Object fusion in mediator systems. In VLDB 96: Proceedings of the 22th International Conference on Very Large Data Bases, pages 413 424. Morgan Kaufmann Publishers Inc., 1996. L. Popa, Y. Velegrakis, R.J. Miller, M.A. Hernandez, and R. Fagin. Translating web data. In VLDB 02: Proceedings of the 28th International Conference on Very Large Data Bases, pages 598609. Morgan Kaufmann Publishers Inc., 2002.

[LMSS95]

[LRNST02] [LRO96]

[Mil98] [MKA+ 02]

[Mot90]

[MP03]

[NPB03]

[NSS02]

[NWQ+ 02]

[PAGM96]

[PVM+ 02]

66

[RB01] [SBKK03]

Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334350, 2001. W. Sigel, S. Busse, R.-D. Kutsche, and M. Klettke. Designing a Metadata-based Infrastructure for E-Learning. In A. James, S. Conrad, and W. Hasselbring, editors, Proc. of the 5th Workshop Engineering Federated Information Systems (EFIS), pages 89 99. University of Coventry, UK, July 2003. F. Saltor, M. Castellanos, and M. Garcia-Solaco. Suitability of data models as canonical models for federated databases. ACM SIGMOD Record, 20(4):4448, 1991. Ingo Schmitt. Schemaintegration fr den Entwurf fderierter Datenu o banken., volume 43 of DISDBIS. Inx Verlag, St. Augustin, Germany, 1998. A.P. Sheth and J.A. Larson. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing Surveys, 22(3):183236, 1990. A.P. Sheth and R. Meersman. Amicalola Report: Database and Information Systems Research Challenges and Opportunities in Semantic Web and Enterprises. SIGMOD Record, 31(4), 2002. S. Spacciapietra and C. Parent. Conicts and Correspondence Assertions in Interoperable Databases. SIGMOD Record, 20(4):4954, 1991. S.Melnik, E. Rahm, and P.A. Bernstein. Rondo: A programming platform for generic model management. In Proc. ACM SIGMOD 2003, June 2003. J.D. Ullman. Information Integration using Logical Views. In Proc. 6th Int. Conf. on Database Theory, Delphi, Greece, 1997. exolab.org, The Castor Project. http://castor.exolab.org. last visited 2005/04. Nec Research Institute, CiteSeer Scientic Literature Digital Library. http://www.citeseer.org. last visited 2005/04. Dublin Core Metadata Initiative, Homepage. http://dublincore.org/. last visited 2005/04. Google. http://www.google.de/. last visited 2005/04. HP labs, Jena Java RDF API and toolkit. http://www.hpl.hp.com/ semweb/. last visited 2005/04. Project New Economy - Homepage. http://neweconomy.e-learning. fu-berlin.de/. Project founded of the bmb+f within the program Neue Medien in der Bildung, last visited 2005/04. RDQL - RDF Data Query Language. http://www.hpl.hp.com/semweb/ rdql.htm. last visited 2005/04. RooDolF 2.0. http://nutria.cs.tu-berlin.de:8080/roodolf2/ index.html. last visited 2005/04. SemanticWeb.org, Homepage. http://www.semantic-web.org/. last visited 2005/04. W3C, Semantic Web Activity. http://www.w3.org/2001/sw/. last visited 2005/04. W3C World Wide Web Consortium. XSL Transformations (XSLT), Version 1.0. W3C Recommendation 16-Nov-99, Nov. 1999.

[SCGS91]

[Sch98]

[SL90]

[SM02]

[SP91] [SRB03]

[Ull97] [URLa] [URLb] [URLc] [URLd] [URLe] [URLf]

[URLg] [URLh] [URLi] [URLj] [W3C99]

67

[W3C01] [W3C04a] [W3C04b] [W3C04c]

W3C World Wide Web Consortium. Web Services Description Language (WSDL) 1.1. W3C Note NOTE-wsdl-20010315, March 2001. W3C World Wide Web Consortium. RDF Primer. W3C Recommendation, REC-rdfprimer-20040210, Feb. 2004. W3C World Wide Web Consortium. RDF Semantics. W3C Recommendation, REC-rdf-mt-20040210, Feb. 2004. W3C World Wide Web Consortium. RDF Vocabulary Description Language 1.0: RDF Schema. W3C Recommendation, REC-rdf-schema20040210, Feb. 2004. W3C World Wide Web Consortium. Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C Recommendation, RECrdf-concepts-20040210, Feb. 2004. Gio Wiederhold. Mediators in the Architecture of Future Information Systems. In Michael N. Huhns and Munindar P. Singh, editors, Readings in Agents, pages 185 196. Morgan Kaufmann, San Francisco, CA, USA, 1997. Jiying Wang and Fred H. Lochovsky. Data extraction and label assignment for web databases. In Twelft International World Wide Web Conference, pages 470480, 2003. H. Wache, T. Vgele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neuo mann, and S. Hbner. Ontology-Based Integration of Information A u Survey of Existing Approaches. In IJCAI-01 Workshop Ontologies and Information Sharing, pages 108117, 2001.

[W3C04d]

[Wie97]

[WL03]

[WVV+ 01]

68

Vous aimerez peut-être aussi