Vous êtes sur la page 1sur 8

eXtreme eXtensibility Page 1 of 8

eXtreme eXtensibility
(XML Schema Design for the Semantic Web)
by Roger L. Costello
January 25, 2003
Issue: how do we design an XML Schema so that it places no restrictions on the vocabulary that instance
documents employ, and which facilitates the growth of data in a highly distributed fashion?

Uses of the eXtreme eXtensibility Design Pattern


This document describes a way of designing XML Schemas which enables data to be collected (and stored as
XML) in an independent, distributed fashion, and with no restrictions on the XML vocabulary. The design
pattern that is presented is especially useful for situations where:

z you would like to collect data about something (and store the data as XML), and
z you would like to allow for the data to be collected by various people (that is, the data is collected in a
distributed fashion), and
z you would like to allow for unforeseen data (that is, no limits on the XML vocabulary employed), and
z you would like to be able to aggregate all the distributed data to generate a consolidated picture.

Here are some examples of situations that would benefit from this design pattern:

z Describing a Geographic Resource: for example, there may be several independent teams of scientists
collecting data about an active volcano. It would be beneficial to enable each team to publish their data
independently, and then at a later time aggregate all the data. Another example in this category is the
collection of data about a river. I will use this as my example throughout this paper.
z Providing Biographical Data about a Person: for example, suppose that you are in charge of a small
team of people tasked to collect biographical data about Albert Einstein. This schema design pattern is
ideally suited, as it allows each member of the team to work and publish independently.
z Intelligence Collecting: by nature, collecting intelligence data is a distributed activity. This schema design
pattern supports that data collection activity.

These are just a few uses of the schema design pattern that will be presented in this paper. In general, whenever
you want your data collection and publication activities to take advantage of the Web environment then this is a
good design pattern. Clearly, its use is limited only by your imagination.

Classical XML Schema Design


Classical XML Schema design is prescriptive. That is, the schema identifies a resource type and prescribes what
data is allowed for that resource type. For example, consider this data about China's Yangtze river:
<River id="Yangtze">
<Length>6300 kilometers</Length>
<StartingLocation>western China's Qinghai-Tibet Plateau</StartingLocation>
<EndingLocation>East China Sea</EndingLocation>
</River>

Classical schema design would declare a River element, comprised of a sequence of child elements - Length,
StartingLocation, and EndingLocation. For example, here's an XML Schema that follows the classical design
approach:
<element name="River">
<complexType>
<sequence>
<element name="Length" type="string"/>
<element name="StartingLocation" type="string"/>
<element name="EndingLocation" type="string"/>
</sequence>

http://www.xfront.com/eXtreme-eXtensibility.html 9/2/2011
eXtreme eXtensibility Page 2 of 8

<attribute name="id" type="ID" use="required"/>


</complexType>
</element>

Deficiencies of the Classical Schema Design in a Web Environment


XML instance documents and, by implication, XML Schemas are intended to be used within a Web
environment. Yet classical XML Schema design is at odds with the Web environment, as we shall see below.

The Web Approach to Data - eXtreme eXtensibility!


Suppose that you create a Web page containing data about the Yangtze river. Independent of you, and without
your knowledge, I can create a Web page that contains other data about the Yangtze river. And I can link my
Web page to your Web page. Your Web page adds value to my Web page and vice versa. A third person can
create still another Web page containing more data about the Yangtze river, and link it to both of our Web
pages. A rich Web of data about the Yangtze river is emerging.

Note the characteristics of this Web approach to providing data about the Yangtze river:

1. Distributed Data: the data about the Yangtze river is distributed over multiple Web pages.
2. Unlimited Vocabulary: the richness and amount of data that is available to describe the Yangtze river is
limited only by the imagination of the Web page designers.
3. Unordered: there is no order to the data about the Yangtze river. The hyperlinked Web pages comprises
an unordered collection of data.
4. Aggregation: the sum total data about the Yangtze river is obtained by aggregating the disparate Web
page data.

The Web is the classic example of what Dee Hock refers to as a "chaord" (a system that has a mix of chaos and
order).

The Classical XML Schema Approach to Data


A typical XML Schema for defining the structure and type of allowable data about the Yangtze river is shown
here:
<element name="River">
<complexType>
<sequence>
<element name="Length" type="string"/>
<element name="StartingLocation" type="string"/>
<element name="EndingLocation" type="string"/>
</sequence>
<attribute name="id" type="ID" use="required"/>
</complexType>
</element>

The XML Schema specifies what data is allowed ("you can give the Length of the river, its StartingLocation,
and its EndingLocation"), and in what order. If you create an XML Web page, and the page conforms to the
above schema then it must necessarily be limited to the type of data dictated by the schema, e.g.:
<River id="Yangtze">
<Length>6300 kilometers</Length>
<StartingLocation>western China's Qinghai-Tibet Plateau</StartingLocation>
<EndingLocation>East China Sea</EndingLocation>
</River>

If I create an XML Web page then my page will look the same. And if a third person creates an XML Web page
it too will look the same. The schema imposes uniformity, regulated, controlled data design.

Let's note the characteristics of the classic XML Schema approach to defining data about the Yangtze river:

1. Centralized Data: even though there may be multiple Web sites containing data about the Yangtze river
they all contain the same data (vocabulary). Thus, there is effectively one, centralized data.
2. Limited Vocabulary: the richness and amount of data that is provided for describing the Yangtze river is

http://www.xfront.com/eXtreme-eXtensibility.html 9/2/2011
eXtreme eXtensibility Page 3 of 8

strictly limited by the XML Schema. Unforeseen ways of describing the Yangtze river is not enabled.
3. Ordered: there is strict ordering of the data about the Yangtze river.
4. Collection: the sum total data about the Yangtze river is obtained by going to any one of the Web sites and
collecting the data.

Recap: Web Data Design vs. Classic XML Schema Data Design
Below is a table which summarizes the above discussion.

XML Schemas Web


controlled growth of data anarchical growth of data
centralized data distributed data
collect data aggregate data
vertical data design lateral data design

Effective Use of XML and XML Schemas in a Web World


To make XML, and by implication XML Schemas, usable in a Web world requires a shift in how we design
XML Schemas and in how we think about data.

Below are the requirements on an XML Schema for describing the Yangtze river where we make effective use
of the Web environment:

z Unlimited Vocabulary: There can be no restriction on the contents of the River element. This will allow
for unforeseen data. Any vocabulary that the schema provides should be considered as merely a starting
point.
z Unordered: There can be no restriction on the order of the data

[Note: complete anarchy would place absolutely no restrictions on the content of the River element. For
example, it would allow the River element to contain a "fuselage" element. We don't want complete anarchy.
We want to allow any elements, as long as they make sense for a River.]

Next we will see how to design a schema that takes into account these requirements.

XML Schema Design for eXtreme eXtensibility!


Here's how to design a schema that meets the above requirements:
<element name="River">
<complexType>
<sequence>
<any namespace="##targetNamespace" maxOccurs="unbounded"/>
</sequence>
<attribute name="id" type="ID" use="required"/>
</complexType>
</element>
<element name="Length" type="string"/>
<element name="StartingLocation" type="string"/>
<element name="EndingLocation" type="string"/>

Let's point out the key features of this schema:

z There is no restriction to the contents of River, other than to say that the contents must belong to this
namespace (in other words, I only want to allow as content elements which make sense for the River
element).
z An initial vocabulary - Length, StartingLocation, EndingLocation - is provided for use in the contents of
the River element. This vocabulary may be extended, as we shall see.

Let's go back to the example of the three people building Web pages: you create an XML Web page that uses
just the vocabulary provided in the schema:

http://www.xfront.com/eXtreme-eXtensibility.html 9/2/2011
eXtreme eXtensibility Page 4 of 8

<River id="Yangtze"
xmlns="http://www.geodesy.org/river">
<Length>6300 kilometers</Length>
<StartingLocation>western China's Qinghai-Tibet Plateau</StartingLocation>
<EndingLocation>East China Sea</EndingLocation>
</River>

It is important to note that there is no restriction on the ordering of the elements within River. Further, there is
no restriction on the number of occurrences of an element (we will see an example of this below). Thus, this is a
general technique for enabling the contents of an element to occur in any order, and with any number of
occurrences! Note that this technique has similarities to the <all> element, but is much more powerful!

Next, I would also like to create an XML Web page about the Yangtze river but I would like to provide different
data - data about the famous Three Gorges Dam. Here's my XML Web page:
<River id="Yangtze"
xmlns="http://www.geodesy.org/river"
xmlns:d="http://www.geodesy.org/dam">
<Dam>
<d:Name>The Three Gorges Dam</d:Name>
<d:Width>1.5 miles</d:Width>
<d:Height>610 feet</d:Height>
<d:Cost>$30 billion</d:Cost>
</Dam>
</River>

The Dam element is not one of the initial vocabulary that the XML Schema declares, so I will need to
supplement the initial vocabulary by creating my own schema that declares the Dam element:
<include schemaLocation="River.xsd"/>
<import namespace="http://www.geodesy.org/dam"
schemaLocation="Dam.xsd"/>

<element name="Dam">
<complexType>
<sequence>
<any namespace="http://www.geodesy.org/dam" maxOccurs="unbounded"/>
</sequence>
</complexType>
</element>

The declaration for the Dam element says that its contents is anything, just as long as the elements come from
the http://www.geodesy.org/dam namespace. The schema, Dam.xsd, defines that namespace. Here's Dam.xsd:
<element name="Name" type="string"/>
<element name="Width" type="string"/>
<element name="Height" type="string"/>
<element name="Cost" type="string"/>

Finally, the third person wants to provide data about the different names that the Yangtze river has acquired
over time:
<River id="Yangtze"
xmlns="http://www.geodesy.org/river">
<Name>Dri Chu - Female Yak River</Name>
<Name>Tongtian He, Travelling-Through-the-Heavens River</Name>
<Name>Jinsha Jiang, River of Golden Sand</Name>
</River>

Since this third person is also using vocabulary that is not present in the initial schema vocabulary list, he/she
must supplement the initial vocabulary by creating a schema:
<include schemaLocation="River.xsd"/>

<element name="Name" type="string"/>

Note that since the River element placed no restrictions on its contents, we are able to use the Name element

http://www.xfront.com/eXtreme-eXtensibility.html 9/2/2011
eXtreme eXtensibility Page 5 of 8

repeatedly.

Recap of the Example


The above example showed how an XML Schema can be designed to enable each XML Web page to be
designed independently and with no restriction on what or how much data is provided for the Yangtze river. It
demonstrates a growing body of information about the Yangtze river. Aggregator tools can scoop up all the
disparate pieces of (XML-structured) data to present a consolidated view of the Yangtze river. Nice!

Characteristics of this Design Pattern


This paper has presented a way of designing schemas that enables data to be collected (and represented using
XML) in an independent, distributed fashion, and with no restriction on the XML vocabulary. It is important to
highlight the key characteristics of schemas that employ this design pattern:

1. Schemas are Small: using this design pattern you create one schema for each resource (e.g., create a
schema for a river), and provide an initial set of vocabulary that may be used with that resource (e.g.,
Length, StartingLocation, and EndingLocation). If you have a different resource (e.g., volcano) then you
would create a different schema (and provide an initial set of vocabulary that may be used with the volcano
resource).
2. Namespace = Class: with this design pattern the schema's targetNamespace is essentially defining a class
(e.g., the targetNamespace for the River example was: http://www.geodesy.org/river. This can be
interpreted as, "this schema is defining the river class")

Aggregation
Earlier I referred to the ability of a tool to aggregate disparate pieces of data and to present a consolidated view.
Let's now look at how to aggregate the disparate Yangtze river data.

In the example we had three people who created data about the Yangtze river. We ended up with 3 schemas and
3 instance documents:

The 3 Schemas
River.xsd: declared the River element, and declared the initial set of vocabulary that may be used to populate
the River element - Length, StartingLocation, and EndingLocation.

River2.xsd: this schema supplemented the initial vocabulary with a Dam element.

River3.xsd: this schema also supplemented the initial vocabulary with a Name element.

The 3 Instance Documents


Yangtze.xml: this instance document showed the Length, StartingLocation, and EndingLocation of the River.

Yangtze2.xml: this instance document described the River's Three Gorges Dam.

Yangtze3.xml: this instance document listed the different names that the River has acquired.

Putting the data together


An aggregator tool will assemble the various data about the River, to create a single document:
<River id="Yangtze"
xmlns="http://www.geodesy.org/river"
xmlns:d="http://www.geodesy.org/dam">
<Length>6300 kilometers</Length>
<StartingLocation>western China's Qinghai-Tibet Plateau</StartingLocation>
<EndingLocation>East China Sea</EndingLocation>
<Dam>
<d:Name>The Three Gorges Dam</d:Name>
<d:Width>1.5 miles</d:Width>

http://www.xfront.com/eXtreme-eXtensibility.html 9/2/2011
eXtreme eXtensibility Page 6 of 8

<d:Height>610 feet</d:Height>
<d:Cost>$30 billion</d:Cost>
</Dam>
<Name>Dri Chu - Female Yak River</Name>
<Name>Tongtian He, Travelling-Through-the-Heavens River</Name>
<Name>Jinsha Jiang, River of Golden Sand</Name>
</River>

Thus, this instance document is a composite of all the disparate data about the Yangtze river!

To validate the composite instance document the aggregator tool must compose all of the schema declarations
into a single file. Here's the resulting schema:
<import namespace="http://www.geodesy.org/dam"
schemaLocation="Dam.xsd"/>

<element name="River">
<complexType>
<sequence>
<any namespace="##targetNamespace" maxOccurs="unbounded"/>
</sequence>
<attribute name="id" type="ID" use="required"/>
</complexType>
</element>
<element name="Length" type="string"/>
<element name="StartingLocation" type="string"/>
<element name="EndingLocation" type="string"/>
<element name="Dam">
<complexType>
<sequence>
<any namespace="http://www.geodesy.org/dam" maxOccurs="unbounded"/>
</sequence>
</complexType>
</element>
<element name="Name" type="string"/>

Thus, the vocabulary that is now available for describing a River is - Length, StartingLocation, EndingLocation,
Dam, and Name!

Aggregation Problem
The power of this design pattern is that it empowers each person to independently describe the Yangtze river
with no limitats on the vocabulary (i.e., the XML tags) that is employed to describe the River (rather, the
vocabulary must all belong to the http://www.geodesy.org/river namespace).

With that power comes potential problems when aggregating the disparate data. For example, suppose that two
people (independently) want to describe the river's Dam. They each create their own schema which declares a
Dam element. Everything is fine. Each person can separately validate their instance data. However, when an
aggregator tool scoops up all the instance data, and all the schema declarations, then there will be a problem -
the schema will have two (most likely different) declarations of Dam. That's a problem.

What's the solution? Here are some ideas that I have (I welcome other people's ideas on this):

1. Validate Separately, Well-Formedness Collectively: each person can individually validate their data.
When the data is aggregated don't bother with validation (what's the point?). Simply check for well-
formedness. I suspect in 99% of the cases this approach is perfectly fine.
2. Wrap Duplicates: when there are duplicates, such as duplicate Dam elements, wrap each of them within a
unique element. Do this both in the instance document as well as the schema. Obviously, this will require a
much more intelligent aggregator tool.

Recommendation
As we have seen, this Web-oriented (eXtreme eXtensibility) XML Schema design pattern is useful whenever
there is a desire to grow a body of information, especially when you want to grow a body of data in a distributed
fashion. The above Yangtze River example is a perfect example of where, over time, you would like to grow a

http://www.xfront.com/eXtreme-eXtensibility.html 9/2/2011
eXtreme eXtensibility Page 7 of 8

body of data about the river. XML Schemas for geographic features is a good candidate for this design pattern.
XML Schemas for collecting information about a person is also a good candidate (for example, if you want to
create a biography of Albert Einstein you may want to "grow" this data in an independent, distributed fashion).
There are many, many other examples of where this design pattern is useful. Your imagination is truly the limit.

The goal of the Semantic Web is to enable anyone, anywhere, anytime to provide data about a resource. Using
the design pattern presented in this paper will bring your data in alignment with the vision of the Semantic Web.

Relation to RDF?
"This sure sounds a lot like RDF - identifying a resource (e.g., Yangtze river), providing data (property/value
pairs) for the resource. Why 'fake it' with XML Schemas? Why not use the 'real thing' ... RDF?"

Your observation is entirely correct. Ideally, we probably should use RDF. However, RDF is a less well
understood by the XML community, it does not provide the same level of type checking that XML Schemas
provides, and there is less support for it in terms of tools. So, the above approach is an interim solution. Using
this approach will help make the step to RDF (and the Semantic Web) less painful.

Unification of XML Schemas and RDF


The best of all possible worlds would be to make instance documents usable both by XML Schema tools as well
as RDF tools. Interestingly, two of the instance documents from the Yangtze river example are also perfectly
fine RDF documents:
<River id="Yangtze"
xmlns="http://www.geodesy.org/river">
<Length>6300 kilometers</Length>
<StartingLocation>western China's Qinghai-Tibet Plateau</StartingLocation>
<EndingLocation>East China Sea</EndingLocation>
</River>

as well as this one ...


<River id="Yangtze"
xmlns="http://www.geodesy.org/river">
<Name>Dri Chu - Female Yak River</Name>
<Name>Tongtian He, Travelling-Through-the-Heavens River</Name>
<Name>Jinsha Jiang, River of Golden Sand</Name>
</River>

The other instance document, however, is not good RDF:


<River id="Yangtze"
xmlns="http://www.geodesy.org/river"
xmlns:d="http://www.geodesy.org/dam">
<Dam>
<d:Name>The Three Gorges Dam</d:Name>
<d:Width>1.5 miles</d:Width>
<d:Height>610 feet</d:Height>
<d:Cost>$30 billion</d:Cost>
</Dam>
</River>

The reason is that RDF tries to treat the Dam element as a property. The Dam element is actually a class (and
Name, Width, Height, and Cost are properties of the Class). The solution is to wrap the Dam element:
<River id="Yangtze"
xmlns="http://www.geodesy.org/river"
xmlns:d="http://www.geodesy.org/dam">
<RiverStructure>
<Dam>
<d:Name>The Three Gorges Dam</d:Name>
<d:Width>1.5 miles</d:Width>
<d:Height>610 feet</d:Height>
<d:Cost>$30 billion</d:Cost>
</Dam>

http://www.xfront.com/eXtreme-eXtensibility.html 9/2/2011
eXtreme eXtensibility Page 8 of 8

</RiverStructure>
</River>

Now this instance document is also fine RDF!

Since the world is moving towards a Semantic Web it makes very good sense to give your data dual use -
useable both by XML Schema tools, and by RDF tools. (A friendly hint)

The Aggregation Problem Noted Above is not Present with RDF


In the above section on Aggregation I noted a problem with aggregating disparate schema declarations into a
single schema when there are multiple different declarations of the same element (the example I gave was of
two people independently declaring a Dam element). This is an inherent problem with XML Schemas.

RDF, however, does not have this limitation. Anyone, anywhere, and at anytime can define the same properties,
and when they are aggregated there will not be a problem.

Acknowledgements
I would like to gratefully acknowledge the following people for their excellent insights, suggestions, and advice:

z Dave Case
z David Jacobs
z Frank Manola

http://www.xfront.com/eXtreme-eXtensibility.html 9/2/2011

Vous aimerez peut-être aussi