Vous êtes sur la page 1sur 6

Generic Model Management:

A Database Infrastructure for Schema


Manipulation

Philip A. Bernstein

Microsoft Corporation, One Microsoft Way


Redmond, WA 98052-6399 USA
philbe@microsoft.com

Abstract. This paper summarizes the model management approach to


generic schema management. The goal is to raise the level of abstrac-
tion of data manipulation for programmers of design tools and other
model-driven applications. We explain the main concepts of model man-
agement, report on some recent progress on model matching algorithms,
and sketch a category-theoretic approach to a formal semantics for model
management.

1 Introduction
Many of the problems of cooperative information systems involve the design,
integration, and maintenance of complex application artifacts: application pro-
grams, databases, web sites, workflow scripts, formatted messages, user inter-
faces, etc. Engineers who perform this work use tools to manipulate formal de-
scriptions, or models, of these artifacts: object diagrams, database schemas, web
site layouts, control flow diagrams, XML schemas, form definitions, etc. The tools
are usually model-driven, in the sense that changes to the tool can be accom-
plished by updating the tool’s meta-model. For example, an entity-relationship
design tool may allow a user to add fields to the types Entity, Attribute, and
Relationship, and to add a function that uses those fields to generate SQL DDL.
This allows the user to customize the tool for database systems that have special
DDL capabilities not anticipated by the tool designer. Often, the run-time appli-
cation itself is model-driven. For example, a message-translator for business-to-
business e-commerce might be built as an interpreter of mappings that describe
transformations between types of messages. Changes to the message-translator
can be accomplished simply by modifying the mapping.
Model-driven tools and applications need a persistent representation of the
models that they manipulate. The database field’s last big push to meet the spe-
cial needs of these types of design applications was the development of object-
oriented database systems (OODBs) starting in the mid-to-late 1980s. Yet to-
day, most model-driven applications are built using relational databases and
files. This is rather odd, given that OODBs not only have the usual database
amenities, such as a query language and transactions, but also offer several

C. Batini et al. (Eds.): CoopIS 2001, LNCS 2172, pp. 1–6, 2001.

c Springer-Verlag Berlin Heidelberg 2001
2 Philip A. Bernstein

valuable benefits over relational databases: a rich type system in which to de-
scribe models, smooth integration with object-oriented programming languages,
a versioning facility, and dynamically extensible types. These features are surely
helpful to developers of model-driven applications, among others, yet customer
acceptance of OODBs has been disappointing. The reason usually cited is the
difficulty of displacing a dominant technology like SQL database systems by a
less mature and robust one like OODBs. This may not, however, be the whole
story.
In our opinion, what is missing from today’s persistent storage systems are
features that offer an order-of-magnitude productivity gain for programmers of
model-driven applications over existing relational and OO databases, much like
relational database systems achieved over their predecessors for business-oriented
applications. Such an order-of-magnitude gain requires a leap of abstraction in
the programming interface. The opportunity to achieve this gain lies in the
fact that most model-driven applications include a lot of record-at-a-time or
object-at-a-time navigational code, even when a query language like SQL or
OQL is available. To avoid this navigational programming, we need a higher-
level data model, one with higher-level operations on higher-level objects that
would greatly reduce the amount of navigational code required by model-driven
applications. Our proposal for this higher level data model is summarized in the
next section.

2 Models and Mappings


In [4], we proposed an algebra that manipulates models and mappings, each of
which connects the elements of two models. The algebra is meant to be generic,
in the sense that it can be applied to models (i.e., schemas) expressed in many
different data models. Some of the key operations are:

– Match - which takes two models as input and returns a mapping between
them
– Compose - which takes two mappings as input and returns their composition
– Merge - which takes two models and a mapping between them as input and
returns a model that is the merge of the two models (using the mapping to
guide the merge)
– Set operations on models - union, intersection, difference
– Project and Select on models - comparable to relational algebra

To see how these operations might be used, consider the problem of popu-
lating a data warehouse. Suppose we have a model S1 of a data source and a
mapping map1W from S1 to a model W of a data warehouse. Now we are given
a model S2 of a second data source (see Figure 1). We can integrate S2 into
the data warehouse as follows: (1) Match S2 and S1 , yielding a mapping map21 ;
(2) Compose map21 with map1W , to produce a mapping map2W from S2 to W .
Step (1) characterizes those parts of S2 that are the same as S1 . Step (2) reuses
map1W by applying it to those parts of S2 that are the same as S1 . In [5], we
Generic Model Management 3

described a detailed version of this scenario and others like it, to show how to
use the model management algebra to solve some practical data warehousing
problems.

Fig. 1. A Simple Data Warehouse Scenario Using Model Management

Even if one agrees that the above algebra is a plausible candidate for raising
the level of abstraction of meta-data management, there is still the question of
defining a simple and precise semantics for generic schema management opera-
tions. A panel on this question took place at VLDB 2000 [3]. On the whole, the
panelists found it worthwhile to pursue the goal of generic schema management.
However, there was some disagreement on how difficult it will be to define the
semantics of operations that can address the complexity of real-world data.
The examples in [5] were one step toward defining and validating an opera-
tional semantics to handle this comlexity. More recently, we have made progress
on two related fronts. They are described in the next two sections.

3 Schema Matching

To bootstrap the development of a model management implementation, we need


mappings. Hence, the first operation of interest is the one that generates map-
pings, namely, Match.
It is implausible that one can develop a fully automatic implementation of
Match. The main reason is that most models are incomplete. That is, they have
a lot of semantics that is known or discoverable by human designers but is not
formally expressed in the model. Despite this inherent incompleteness, there is
still quite a lot of information that a match algorithm can use to produce a
draft mapping, for review by a human designer. Such information includes the
names of objects, their data types, their overall structure, example instances,
and auxiliary information about the domain of interest, such as a thesaurus and
glossary of acronyms.
There is a substantial research literature on match algorithms that exploit
this kind of information. We surveyed them in [9], where we developed a tax-
onomy of approaches. Using this taxonomy for guidance, we integrated many of
4 Philip A. Bernstein

these approaches into a hybrid algorithm, called Cupid. Cupid uses properties
of individual elements, linguistic information of names, structural similarities,
key and referential constraints, and context-dependent matching. The algorithm
is described in [6], along with an experimental comparison to two earlier imple-
mentations, DIKE [8] and MOMIS [2], some of whose ideas were incorporated
into Cupid. While a truly robust solution is not around the corner, results so far
have been promising.
We have found that our implementation of Match is immediately useful to
some tools, even without implementations of the other model management op-
erations. For example, consider a graphical tool for defining a mapping between
XML message types. By incorporating Match, the tool can offer a user a draft
mapping, which the user can then review and refine. We expect that an im-
plementation of other operations will be similarly useful outside of a complete
model management implementation.

4 Formal Semantics
To achieve the goal of genericity, we need a semantics for model management
that is independent of data model. One approach is to define a rich data model
that includes all of the popular constructs present in data models of interest.
This would be an extended entity-relationship (EER) model of some kind, which
includes an is-a hierarchy, complex types, and the usual kinds of built-in con-
straints, such as keys and referential integrity. This is the approach used in the
Cupid prototype. Models expressed in some other data model, such as SQL DDL
or XML schema definitions (XSD [10]), are imported into the native Cupid data
model before they are matched. The resulting mapping is then exported back
into the data model of interest, such as one that can be displayed in a design
tool (e.g., for SQL or XSD). That way, Cupid can support many different data
models and, hence, many different applications.
It would be desirable to have a carefully developed mathematical semantics
that shows that the EER data model used by Cupid, or one like it, is generic.
One way to go about this would be to demonstrate a translation of each data
model of interest into the chosen EER model and analyze properties of that
translation, such as how much semantics is preserved and lost. This approach is
likely to be enlightening, but it is still on the docket.
A second approach is to analyze models and mappings in the abstract, using
category theory [7]. A category is an abstract mathematical structure consist-
ing of a set of uninterpreted objects and morphisms (i.e. transformations) be-
tween them. To apply it to models and mappings, one represents the well formed
schemas of a data model as a category, whose internal structure is uninterpreted.
Models are objects of a schema category, and mappings between models are mor-
phisms between objects of a schema category. A mapping between two categories
is called a functor. A functor can be used to provide an interpretation of schemas;
the functor maps each schema in a schema category to a category of instances,
which are the set of all databases that conform to the schema. One can then use
Generic Model Management 5

some standard constructions in category theory to study the effects of combining


mappings in certain ways. For example, consider two mappings that integrate
each of two data source schemas into a target schema. This construction resem-
bles the category-theoretic concept of ôpushout,ö which in this case is a minimal
target schema that is, in effect, a subschema of any other integration of the two
data sources.
We have used this categorical approach to study the effect of schema inte-
gration on integrity constraints [1]. The concept of an expression language for
constraints is defined by an uninterpreted functor that maps each schema to
the set of well-formed formulas over that schema (i.e., constraints that use the
schema’s objects as terms in the formulas). The set of well-formed formulas is
just a set of sentences. At the level of category theory, they’re uninterpreted.
This allows us to talk about constraint satisfaction in general, independent of
the logic that is chosen to express the constraints. Using this abstract model
of constraints, we can characterize the properties of a constraint preserving in-
tegration of data source schemas and their integrity constraints into a target
schema with its integrity constraints.
Category theory gives the “shape” of the model management theory. That is,
it explains the main theorems to be proved for any given data model. To make
that theory useful, one has to apply its results to specific categories of interest,
such as relational, OO, or XML schemas plus an associated constraint logic,
and to specific types of schema transformations. That application of the theory
uses the additional structure of the schema types and constraints of interest to
drive through the proofs of those theorems. We did this for a simple OO schema
language and simple transformations. It would be valuable to extend this to
other data models and more complex kinds of transformations.
This study of schema integration serves as a template for using the categorical
approach to study other schema mapping problems, such as database evolution
and migration, and to study the semantics of transformations that correspond
to model management operations. It also demonstrates a tangible benefit of the
categorical approach, namely a generic formal study of constraints in schema
mapping, the first such treatment we know of.

5 Conclusion

We have briefly reviewed the model management approach to generic schema


management and summarized some recent progress in the areas of model match-
ing and formal semantics. These are steps along what we expect to be a very
long research path. Other topics worth investigating are the semantics and im-
plementation of Merge and Compose, properties of translations from standard
data model into a generic EER one, and the development of detailed examples
that apply model management to practical problems.
6 Philip A. Bernstein

Acknowledgments

The author is very grateful to the following researchers whose work is sum-
marized in this paper: Suad Alagic, Alon Halevy, Jayant Madhavan, Rachel
Pottinger, and Erhard Rahm.

References
1. Alagic, S. and P. A. Bernstein: “A Model Theory for Generic Schema Manage-
ment,” submitted for publication. 5
2. Bergamaschi, S., S. Castano, and M. Vincini: “Semantic Integration of Semistruc-
tured and Structured Data Sources.” SIGMOD Record 28,1 (Mar. 1999), 54-59.
4
3. Bernstein, P. A.: “Panel: Is Generic Metadata Management Feasible?” Proc. VLDB
2000, pp. 660-662. Panelists’ slides available at
http://www.research.microsoft.com/ philbe. 3
4. Bernstein, P. A., Halevy, A., and R. A. Pottinger: “A Vision for Management of
Complex Models.” ACM SIGMOD Record 29, 4 (Dec. 2000). 2
5. Bernstein, P. A. and E. Rahm: “Data Warehouse Scenarios for Model Manage-
ment,” Conceptual Modelling - ER2000, LNCS1920, Springer-Verlag, pp. 1-15. 3
6. Madhavan, J., P. A. Bernstein, and E. Rahm: “Generic Schema Matching Using
Cupid,” VLDB 2001, to appear. 4
7. Mac Lane, S.: Categories for a Working Mathematician, Springer, 1998. 4
8. Palopoli, L. G. Terracina, and D. Ursino: “The System DIKE: Towards the Semi-
Automatic Synthesis of Cooperative Information Systems and Data Warehouses.”
ADBIS-DASFAA 2000, Matfyzpress, 108-117. 4
9. Rahm, E., and P. A. Bernstein: “On Matching Schemas Automatically,” Microsoft
Research Technical Report MSR-TR-2001-17, Feb. 2001. 3
10. W3C: XML Schema, http://www.w3c.org/XML/schema, 2001. 4