Vous êtes sur la page 1sur 7

GIS06

- 1 -
DATABASE MANAGEMENT
Last day we looked at spatial data structures for both vector and raster data models. When working with large
amounts of data, it is important to have good procedures for managing the data. In the past this involved working
with data organised into files; today the major GIS systems use data organised into a database managed by a
database management system (DBMS). The aim of a database management system, according to Burrough and
McDonell (1998, 50) is 'to make data quickly available to a multitude of users whilst still maintaining its integrity; to
protect the data against deletion and corruption; and to facilitate the addition, removal and updating of data as
necessary.'
We will begin by looking briefly at some of the file formats used in the past before looking at some of the issues
associated with using a DBMS with spatial data.
FILES
In the past, GIS and related technologies (such as CAD) used data organised into files. The files used by any one
application usually had a specific format, which in many cases could not be read by other systems. This created
obvious problems with regard to the reusability of data. Over the course of time various standard formats emerged,
either de facto due to the popularity of particular systems or more formally as a result of definition by government
agencies or international standards committees. GIS applications software usually provided facilities to convert the
data in these standard formats into their own proprietary format. This provided a partial solution to the problem of
reusability i.e. the data could be transferred from one system to another but it was an incomplete solution because
the translation process imposed a computing overhead. Data that were frequently changed or updated had to be re-
translated after every change.
The solution for many organisations was to store the data in a database managed by a DBMS. Rather than accessing
the data directly, a GIS could now access the data through a DBMS. The major GIS were designed to work with
different DBMS, whilst each DBMS could be accessed by different GIS. This meant the data managed by the DBMS
were no longer limited to a specific GIS application.
DBMS have not rendered file organised data totally redundant. Several file formats are still widely used for data
exchange (i.e. movement between GIS systems). Also, for small simple projects, it may be more cost effective to
work with data organised in files rather than stored in a database.
File Formats
Some of the more important file formats (either historically or in use today) are listed below:
DIME (Dual Independent Map Encoding). This was an encoding scheme developed by the U.S. Bureau of the
Census in the late 1960s to facilitate the storage of street map data to be used in collecting the census. The actual file
format for DIME files was known as GBF (Geographic Base Files). It is no longer used, but it was instrumental in
the early development of topological structures. The 'dual' part of its title refers to the fact that each arc contained
information on the blocks (i.e. polygons) on either side and the to and from node numbers. This duplication (or
redundancy) of topological information facilitates consistency checking.
TIGER (Topologically Integrated Geographic Encoding and Referencing). DIME was replaced by the US Census
by TIGER in the 1980s for the 1990 census. TIGER files use a more complex topological structure that allows
information to be directly addressed and therefore retrieved more speedily without the need for lengthy sequential
searches. It is still used today. The proprietary file format is known as TIGER/Line, but TIGER data were also made
available in 2008 as shapefiles. It is proposed to use GML (Geographic Markup Language) in the future.
DLG (Digital Line Graph). DLGs are distributed by the US Geological Survey. They capture data from maps at a
variety of medium scales (1:2000, 1:24,000, 1:25,000, 1:62,500, 1:100,000), and small scales (1:2,000,000) scales.
Locational data are recorded as UTM coordinates rounded to the nearest 10 metres. The DLG format includes
topological information. DLG data are distributed in various formats, including SDTS (Spatial Data Transfer
Standard).
GIS06
- 2 -
DXF (Drawing eXchange Format). This format was devised in 1982 to export data from the CAD package
AutoCAD, although AutoCAD uses a different format (DWG) for its own internal use. DXF files are simple Ascii
files with a large header containing large amounts of metadata. Ordnance Survey Ireland traditionally provided data
in DXF format. However, although DXF organises the data in separate layers, DXF does not include any topology.
DXF files can be read by MapInfo, ArcView, ArcGIS and various other GIS, and layers can even be separated out
and saved as a shapefile (see below). However, because these files do not contain any topology, you generally need
to use ArcInfo (or something similar) to create the topology (e.g. join the arcs into polygons) before the data can be
used in a non-trivial way within a GIS.
1
Arc/Info Coverages. Arc/Info coverages were introduced in Arc/Info 1 in 1982. They combine both vector and
attribute data (the arcs and the info). The vector data contains topological information. Each coverage is stored in
a directory, and each feature class (i.e. a set of points, lines or polygons) is saved in a file. A coverage may have
more than one feature class. Annotation, tic marks etc. are also saved as separate files. Computers were much less
powerful in the 1980s, so the data are saved in indexed binary files to speed up access times.
ArcView Shapefiles. Shapefiles were introduced in ArcView 2.0 by ESRI in 1990. Each shapefile is comprised of a
minimum of three files: one is a direct access file, with variable length records, to store the geometrical shapes of the
objects (.shp); a second stores non-spatial data tables in a standard dBase format (.dbf); and a third stores an index
linking each of the features to the non-spatial data (.shx). Additional files may be created during processing.
Shapefiles do not contain topological information. Rather they use an object-based design in which each feature is
treated as a separate object that can be linked to attribute data. Shapefiles require less space and can be processed
quicker than their topological equivalents. For GIS purposes, the shapefiles normally need to be joined to attribute
data stored in separate dbf files, although the attribute data may also be saved as part of the shapefile dbf file.
DATABASES
A database may be defined as: a unified computer-based collection of data, shared by authorised users, with the
capability for controlled definition, access, retrieval, manipulation, and presentation of data (Worboys and
Duckham, 2004, 39). The database (i.e. the data collection) is managed by a database management system (DBMS).
Individual users do not directly access the data; rather they interact with the data through the DBMS. Thus, for
example, they might send a request for information using a high level query language, such as SQL (Structured
Query Language). The DBMS translates the query into the low level terms required to process and retrieve the
information.
When working with large volumes of data, which may need to be accessible by large numbers of different people, it
is important to design the database to minimise redundancy (i.e. unnecessary duplication of information), whilst
maintaining consistency (e.g. if the data values for an entity are changed, then the same changes should be consistent
throughout the system). The system must also be able to guarantee security (i.e. provide limits on who can read what
types of data, and who can update different types of data), protect against data loss caused by system failures (e.g. by
making regular backups) and be able to manage concurrence (i.e. handle what to do if several people try to access
the same data at the same time). Decisions must be made as to whether the database should be distributed (i.e.
physically located in different sites) or not. When designing a large database it is also important to try to ensure there
is sufficient flexibility to adjust to unforeseen contingencies (e.g. the need to add new types of data). This generally
involves trying to separate the procedures for querying the data from the structures used to store the data (data
independence). This is one of the key features of a database i.e. users interact with a representation of the data
which is independent of the actual physical storage and the DBMS translates the users instructions into operations
on the actual data.
Different levels of abstraction can be identified in the design of a database:

1. Conceptual model. This is the highest level of abstraction and it basically involves trying to identify the entities
(or objects) and attributes for which data are required, and the relationships between them. Jones (1997)
explains in some detail some of the techniques that may be used to develop conceptual models (e.g. entity
relationship modelling, semantic modelling, etc.).

1
OSI also provides data in DWG format, but this creates exactly the same problems.
GIS06
- 3 -
2. Logical (or internal) model. The logical model refers to the data structures and retrieval mechanisms used for
managing the data. Most database management systems use one of a small number of logical models (see
below).
3. Physical model. This refers to the precise organisation of the data inside the computer (e.g. the number of bytes
allocated to each data item, their physical addresses, etc.). This generally does not concern us too much because
it is managed by the DBMS and is hidden to the end-user.
The conceptual schema within a database (e.g. the entities and their properties) are defined using a data definition
language (DDL). The data are manipulated (i.e. edited, interrogated) through a data manipulation language
(DML) - i.e. the user can send queries using a high level language which the DBMS translates into the terms required
by the lower level physical model.
Older Logical Models
Originally the three most widely used logical models were the hierarchical, network and relational models.
Hierarchical models assumed that the entities are organised in a tree-like hierarchical manner (e.g. Fig b). They tend
to support quick searches, but may result in a lot of data redundancy. Network models can accommodate more
complex structures using pointers between entities (Fig c). This reduces data redundancy, but may result in very
large overheads maintaining the pointers.
Within a relational database, data are logically organised into tables (called relations) containing rows and
columns. Each row (or tuple) corresponds to a particular entity e.g. person, place or thing. Each of the columns
contains a field or attribute i.e. the data values for a particular variable attribute for that particular entity.
One of the main advantages of relational database is that they are very flexible because information from different
tables can be joined by fields containing a unique key. This permits one to many and many to one joins and
eliminates the need for pointers, but it results in a degree of data redundancy (although it can also considerably
reduce data redundancy). Search times also tend to be slower than in the other two models. Nevertheless, the
relational model emerged as the dominant database model, to the extent that many people now assume that all
databases are relational. Each table is usually saved in a different file, although MS Access is an exception as it keeps
all the tables in a single file.
The DBMS for a relational database is usually referred to as a RDBMS (i.e. Relational DBMS).
GIS06
- 4 -
SQL (Structured Query Language) emerged as the standard query language for relational databases. SQL provides
both DDL and DML functionality.
Normalisation is the name given to the systematic elimination of data redundancy, anomalies and other undesirable
characteristics.
Tables in a relational database have various characteristics:
1. The tuples in a relation are all distinct from one another;
2. Each tuple has the same number of attributes (i.e. columns);
3. The ordering of the tuples in a relation has no significance;
4. The ordering of the columns in a relation has no significance (although the data values must be in the correct
column);
5. Each cell contains only one data value (as opposed to a set, array or list).
Newer Logical Models
The three traditional models have been supplemented in more recent years by a few other models. Deductive
databases incorporate mechanisms that use rules to deduce additional facts or relationships from the stored data,
thereby incorporating facilities normally found in expert systems. It is thought that this could provide new
approaches for handling uncertainty and measurement errors in GIS, but they have not been adopted in GIS to any
significant extent.
Object-orientated databases model entities as objects in much the same way as do object orientated programming
languages. Each object not only contains attributes which define the state of that particular entity, but also the
methods (i.e. specialised procedures) for doing various things. For example, in a GIS context a spatial object class
might include methods for testing if selected objects overlap or intersect, or if they are adjacent to one another.
Groups of similar objects are known as object classes, and individual objects are instances of these classes. The data
values themselves are objects. Inheritance is supported i.e. object classes may inherit the properties, data attributes
and methods of other classes. Encapsulation means the methods of an object can be implemented independently
through the interface to the object without knowledge of its internal workings. Objects can send requests to other
objects. Object databases are managed by an ODBMS and can be queried by OQL (Object Query Language).
However, unlike SQL, OQL is used as an extension to a standard object orientated programming language, such as
C++ or Java, and therefore provides a seamless integration - i.e. the objects can be manipulated by either the host
language or OQL without the need to transfer data from one to the other.
Object orientated databases are still relatively uncommon, but object orientated principles have been incorporated
into relational DBMS by some of the major database vendors (e.g. Oracle, IBMs DB2) to form hybrid systems
GIS06
- 5 -
known as object-relational database management systems (ORDBMS). Relations (i.e. tables) are still a major
feature of these systems, but the cells in the tables can be used to store complex objects that encapsulate both data
values and methods (as opposed to a single data value). SQL has been extended to support these more complex data
structures as user-defined abstract data types (ADTs).
DATA MANAGEMENT IN GIS
Conventional relational database management systems (RDBMS) are not really suitable for managing spatial data,
because spatial data have special requirements. For example, if coding the location of a polyline as a series of x,y
grid coordinate pairs, it is essential that the points are stored and retrieved in a particular sequence. Conventional
relational database systems do not impose any restrictions upon the sequence of the tuples (rule 3 above) in a table,
so the storage of vector coordinate data, one row per coordinate pair, in a relational database becomes problematic.
Alternatively, storing all the coordinates for a polyline in a single tuple becomes problematic because the number of
co-ordinate pairs may vary considerably from one line to another, resulting in varying numbers of columns (breaking
rule 2), whilst the columns would need to be in a specific sequence (breaking rule 4) .
Closer examination of the relational structure depicted in the diagram above indicates that it actually incompatible
with a conventional relational database e.g. the arc-node (i.e. normalised) structure on the right potentially has a
variable number of columns per row in both the polygon and line tables, whilst the sequence of the columns in both
tables is important. The same is true of the polygon table in the segment-vertex structure on the left.
Georelational Models
A lot of the data which is stored in a GIS is non-spatial attribute data.
2
Given that relational database management
systems (RDBMS) have evolved very sophisticated methods for handling non-spatial data, there is clearly a strong
argument for using conventional relational database management systems for handling attribute data and other data
structures (such as those discussed last day) for managing the spatial data. As a result many Geographical
Information Systems traditionally used a hybrid georelational model.
The hybrid georelational systems offer a number of advantages over alternative approaches due to the fact that the
spatial and attribute data can be stored in different physical locations and be managed separately - i.e. the attribute
data can be edited without having to modify the spatial data, the management of attribute data can take full
advantage of the standard methods and new developments in RDBMS, and the ways in which the spatial data are
represented can be changed without any adverse implications for the management of attribute data.
Burrough and McDonnell discuss 4 hybrid structures:
1. Arc/Node - RDBMS. This uses a topological arc/node data structure for the geographical features, supporting
full vector mode topology, with links to attribute data in a relational database. One limitation is that if features
on different layers are superimposed to create new features, it is necessary to build new records in the RDBMS.
This is the most widely used hybrid structure.
Arc/Info was one of the earliest programs to use this type of structure. Arc/Info coverages, introduced in the
early 1980s, consisted of two components: spatial data organised using an arc/node structure incorporating
topological information (the Arc) and attribute data managed by Arc/Infos own RDBMS (Info).
2. Object - RDBMS. The spatial features are treated as discrete spatial objects, whilst the attribute data are
managed in a linked RDBMS. This was the model used by ArcView up to version 3.2 (i.e. shapefiles, etc.).
However, the objects are simply shapes defined using geometric primitives (points, lines, polygons), rather than
objects in the normal computer sense (i.e. with properties, methods, etc.). The spatial features of the objects can
be edited using tools provided by the GIS, whereas the attribute data would normally be managed in a RDBMS.
3. Compact Raster - RDBMS. This is very similar except the geographical features (e.g. polygons representing
administrative areas) are defined in raster mode. Rather than containing data values, the cells in the raster

2
Some people argue that attribute data in a GIS are 'spatial' in so far as they represent the properties of spatial
entities, but they are 'non-spatial' in the sense that they do not contain locational information (e.g. grid coordinates).
GIS06
- 6 -
contain a feature identifier which is used to link to a relational database. Given that all cells for the same feature
have the same id code, this type of raster is generally suitable for data compaction (e.g. using run-length
encoding). The Database Workshop option in Idrisi uses this type of structure.
4. Quadtree - RDBMS. This is similar to the previous situation, except that the spatial features are coded as a
quadtree. Burrough and McDonnell note that quadtrees allow the intersection of features at different levels of
spatial resolution. This can be easily handled in vector mode, but only with difficulty in other types of raster
mode.
Field Data Models
Data conceptualised as field data present different problems to data conceptualised as objects. As noted previously,
field data can be modelled in either raster mode or vector mode (using isolines or TINs). However, irrespective of
the method used, there is only one attribute (e.g. rainfall, pollution levels) per layer and the attribute data are
integrated with the locational data. In raster mode the cells contain the attribute value at the location indicated by the
position of the cell in the raster. Using isolines, the location of the isolines connects places having the same attribute
value, whilst using TINs places having a similar slope and aspect are enclosed within the same triangle. In other
words, irrespective of the approach adopted, there is no need for attribute data to be joined to the locational data.
Data storage within a raster-mode GIS is therefore simpler than in a vector-mode GIS using data conceptualised as
objects. Each attribute you wish to include is simply included as an additional raster. Each raster layer is usually
saved in a different file, so the system can load in the appropriate layer as required. This is the model used, for
example, by Idrisi. Despite its simplicity, this structure can facilitate very powerful forms of analysis and information
retrieval, provided the rasters are aligned, overlap and have the same resolution. Discrete data (e.g. land-use
categories coded by nominal scale data values) can be handled in the same way without any great difficulty.
Object Relational Models
Georelational models do not permit management of the locational data by a RDBMS. Similar constraints apply to
field data models. Given the desirability of managing data using a DBMS, especially in the context of a large
organisation, recent years have seen a move away from georelational models as developments in processing power,
the availability of storage space and database technology have made it possible to define more complex user-defined
spatial data types which can be managed using an ORDBMS.
Many of the major enterprise DBMS, such as IBMs DB2 and Informix, Microsofts SQL Server, and Oracle are
ORDBMS. They now provide spatial extensions (DB2 Spatial Extender, Informix Spatial Datablade, Oracle
Spatial), whilst the open source DBMS PostgreSQL supports spatial types and functions (PostGIS). ESRI moved
towards support for object-relational models with the introduction of its geodatabase model in ArcGIS 8 (see
below).
THE GEODATABASE MODEL
ArcGIS 9 can still be used with traditional file based data such as coverages, shapefiles, grids, images and TINS.
However it can also be used with an object-relational data model introduced by ESRI in ArcGIS 8 called a
geodatabase. This combines the advantages of object-orientated approaches with those of a traditional RDBMS.
A geodatabase is a collection of one or more geographic datasets of various types. There are three primary dataset
types:
1. Feature classes. A table containing one row for each feature of a particular type (e.g. point, line, polygon) with
columns containing attributes of those features. One column contains a unique feature ID, whilst another (called
shape) contains its locational details.
2. Raster datasets.
3. Tables. Attribute tables. Much the same as a feature class without a shape field.
Related feature classes are sometimes grouped together to form a feature dataset. There are several possible reasons
to do this, including: adding a topology; organising into a network; building a terrain dataset (i.e. TIN); or simply to
group datasets for convenience, controlling access or ease of sharing.
GIS06
- 7 -
A geodatabase can contain different types of feature class, rasters, TINS and locators (e.g. addresses). Thus, all of
your geographical data can be stored and managed in a single database (which, in turn, may be stored in a single
file).
In a georelational data model the spatial and attribute details are contained in separate files (e.g. shape and dbf files)
with unique identifiers used to join the data. In a geodatabase a feature class can store the spatial details and the
attribute data in the same table. Shapes are held in a field either as a binary large object (BLOB) or as one of the
extended spatial types supported by one of the enterprise ORDBMS (e.g. Oracle Spatial).
There are three types of geodatabase:
1. Personal geodatabases. Personal geodatabases are single-user databases managed by the Microsoft Jet Engine
(i.e. Access) which is built into ArcGIS. They can handle small to medium sized datasets they are currently
restricted to a maximum size of 2 GB. They do not support versioning or topology. They can only be used on a
Windows platform.
2. File geodatabases. Also single-user databases, but stored in a file folder. There is no limit to total capacity, but
each dataset is limited to 1TB of data. They do not support versioning or topology. Not restricted to Windows
platforms. Preferred by ESRI to personal geodatabases or topology.
3. ArcSDE geodatabases are for multiuser / enterprise use. They require ArcSDE running on a server. ArcSDE
provides an interface between ArcGIS and a standard enterprise DBMS such as Oracle, Informix, IBM DB2, or
Microsoft SQL Server or PostgreSQL. Each of these support multi-user editing, versioned workflows, and can
handle very large datasets. ArcView 9 (i.e. the entry-level version of ArcGIS) can use, but not edit, multiuser
databases as long as it has ArcSDE; to edit multiuser databases you require either ArcEdit or ArcInfo.
ArcGIS applications may be thought of as forming a separate tier sitting on top of the DBMS. ArcGIS applications
interact with the generic GIS object model for geodatabases and not with the actual DBMS instance. The DBMS
looks after disc-based storage, definition of attribute types, associative query processing, and multiuser transaction
processing, whereas the ArcGIS application looks after defining the specific DBMS schema used to represent
geographic datasets and domain specific logic. Although the data are stored in relational tables, they are treated in
the application tier as objects. Each feature class is stored in a table, within which each row represents an object (or
class instance) and each column contains an attribute.
ArcGIS implements advanced logic and behaviour in the application tier above the DBMS. This allows you to define
more intelligent features. Features in a georelational model are defined as geometric primitives (i.e. points, lines or
polygons), but in a geodatabase you can define your own feature classes which inherit the properties of generic
features, but which can have additional properties and specific behaviours of their own (including topology). For
example, you could specify that rivers may only flow downhill, that roads must join up with other roads, or that
houses must not straddle property parcels. You can also define rules (i.e. methods) governing the display of the
objects. For example, you might specify that the height labels on contour lines should only be located on the straight
sections of the contours and must be at least 10cms apart, or that rivers should be modelled using Bezier spline
curves rather than straight line segments. Features can be joined to attribute data in the georelational model, but in
the geodatabase model they can be associated with object classes with defined behaviours (i.e. objects related to
geographical features, but which are not geographical features themselves e.g. property owners). The logical model
in a geodatabase therefore more closely resembles the conceptual model.
The data in a geodatabase can be accessed at three levels:
1. Through the geodatabase data access objects (a subset of ArcObjects). At this level the full structure of the
database is revealed. You can program the objects using a COM compliant IDE such as Visual Basic for
Applications or Visual C++.
2. The spatial data may be accessed as simple features (i.e. without topology) through ArcSDE using C or Java.
3. The non-spatial data can be accessed via the SQL interfaces of other applications and DBMS such as Access.
READINGS
Burrough, P. and McDonnell, R.A. (2004) Principles Of Geographical Information Systems. 2nd. ed. Oxford
University Press, Oxford. [ 910.285 ]
Jones, C. (1997) Geographical Information Systems And Computer Cartography. Longman, London. [ 910.285 ]
Longley, P. et al. (2011) Geographic Information Systems And Science. 3
rd
. ed. Wiley, New York. Chapter 10.
Shekhar, S. and Vatsavai, R.R. (2007) Object-Oriented Database Management Systems, in Wilson, J. and
Fotheringham, S. (eds) The Handbook of Geographical Information Science, Blackwell-Wiley, Oxford.
Worboys M. and Duckham, M. (2004) GIS. A Computing Perspective. 2
nd
ed. CRC Press. Chapter 2. [ 910 ]

Vous aimerez peut-être aussi