Vous êtes sur la page 1sur 9

Guide to Data Management Technology

December 2013
By Richard Bellamy
ABSTRACT: The proliferation of NoSQL data management tools has resulted in a large number of
products without a well-documented framework in which to classify them. The proponents of opensource-based data management systems often lack incentive to produce such documentation for
technical novices because they often derive their revenue from supporting the users of their systems.
Although journalists can provide a more objective source of information, their target audience is
generally working professionals. In this paper, I provide a vendor-agnostic introduction to the types of
data management technologies and the qualities that each can have. This paper is targeted at students
of data management that are looking for a framework in which to structure their study of data
management technologies.
CONTENTS
1

Overview of data management technology ......................................................................................... 2


1.1

Data independence ....................................................................................................................... 2

1.2

Database management systems ................................................................................................... 2

1.3

Hierarchical data ........................................................................................................................... 3

1.4

Relations........................................................................................................................................ 4

1.5

NOSQL data management systems ............................................................................................... 4

Qualities of data management systems ............................................................................................... 5


2.1

Atomicity & durability ................................................................................................................... 6

2.2

Consistency distinguished from eventual consistency ................................................................. 6

2.3

Conflict management, isolation & availability .............................................................................. 7

Data management systems................................................................................................................... 7


3.1

Published research ........................................................................................................................ 7

3.2

Open source data management systems ..................................................................................... 8

3.3

Open source query processors ..................................................................................................... 8

Selected references by topic ................................................................................................................. 9


4.1

Data management technology...................................................................................................... 9

4.2

Relational technology ................................................................................................................... 9

4.3

Qualities of data management systems ....................................................................................... 9

4.4

Data management systems........................................................................................................... 9

Page 1

1 Overview of data management technology


1.1 DATA INDEPENDENCE
The goal of a database management system is to provide data independence so that multiple
application programs can share a common base of data without all such application programs needing
to be revised when the physical data structure used for storing the data is altered.
Each separate piece of information, as understood by the application programs that have access to it, is
a data item. A database is a non-redundant collection of related data itemsa common base of data
items that are shared across potentially multiple application programs, such as a payroll system and a
resource scheduling system. For instance, if an employees name must be updated separately in both
systems, then each system is utilizing a separate base of data and the employees name is stored
redundantly. If, however, both systems are sharing the same base of data such that an update to the
employees name from either system would affect both systems equally, then there is one nonredundant base of data that is common across both systems.

1.2 DATABASE MANAGEMENT SYSTEMS


To accomplish data independence, the logical data structure as each application program understands it
must be separated from the physical data structure in which the data is stored, a.k.a. the machine
representation of the data. This is a two-step process in which the first step is to map the field-key
combination in each separate logical data structure to the same data item. The task of performing this
mapping is the set of abstract problems. Proceeding from the point at which the logical data structures
converge, the second step is to retrieve the machine representation of the data item from its physical
storage location.

To state the obvious, a path by which data is accessed is an access path. The less obvious insight here is
that the convergence point is data, and the machine representation is also data. Therefore, there is an
access path from each logical data structure to the data item and another one from the data item to its
machine representation. It is also correct to refer to the whole path from a logical data structure to the
data items machine representation as an access path. The failure to distinguish between these access
paths and identify to which access path one was referring has been a recurring source of communication

Page 2

failure. One such occasion was a computer science symposium in 1971 at which much of the foundation
for the traditional relational database management system was developed.
A system that receives requests for information from application programs and maps those requests to
the data items convergence point is a database management system. Once the access paths have
converged on a single data item, that data item could theoretically be stored in a large, single-value
store. The access path, at that point, becomes dependent on the hierarchy of the physical storage
structure. For this reason the problems of mapping from data item to machine representation have
been called both concrete problems and hierarchy-dependent problems.

It has traditionally been the norm for database management systems to function as a black-box from the
perspective of the application program, and to internally manage the mapping from the point of
convergence to the data store. This was previously necessary to achieve adequate performance.
The concept of a large, single-value store was only a theoretical tool when data management theory was
first being developed. At the time, machine representations were typically stored in file systems, and
large single-value stores were thought to be unrealistic. They have become common, but are now called
key-value stores. The term data store properly refers to systems that solve concrete problems, such as
file systems and key-value stores. Data store is also used by some database management systems that
prefer to market themselves as data stores.

1.3 HIERARCHICAL DATA


Before relational database management systems and SQL came into existence, there were hierarchical
database management systems such as IMS. The hierarchical in hierarchical database management
system refers to the data model that the database management system presents to the application
programs, as opposed to being a reference to the access path of the data items. Although the logical
structure was a significant factor affecting physical storage, it wasnt the only one. IMS supported
serializing the data for storage on tape, as well as locality control through data set groups.
Today, hierarchical database management systems often use XML or JSON as a language for expressing
data and queries. These systems are also called document-oriented database management systems
because they represent each record as an XML or JSON document.

Page 3

1.4 RELATIONS
While hierarchical data provides data independence in theory, and today provides it in practice as well,
the performance of IMS was not initially sufficient in many cases to allow the data to be queried for
report generation except in the manner in which the data was stored. A retailer wishing to analyze past
transactions would have a hard time discovering relationships in the data retrospectively. Data
independence would need to be accomplished by some other means until hardware progressed.
The data industry settled on the use of relations to overcome the performance bottleneck, and
developed SQL to facilitate description of the characteristics of the data that would be most useful for
optimizing query performance. Relational database management systems became synonymous with
data independence, SQL and the routing of all data through query processors early enough that many
people were years into their career before they needed to distinguish between these four separate
characteristics of data management technologies.

1.5 NOSQL DATA MANAGEMENT SYSTEMS


Before NoSQL, the primary alternative to relational database management systems was the file system.
The space between filesystems and query-access-only relational database systemsNoSQL data
management systemsincludes alternative data stores such as key-value stores and alternative
database management systems such as document-oriented systems. The term NoSQL stands for not
only sequel although the meaning that the term was intended to capture is not only query-based
access.
Now, even some relational database management systems are becoming NoSQL systems. They
accomplish this by using their internal system for the indices representing that abstract portion of the
access path, while allowing the internal data store to be swapped out for another one that also provides
direct access to the information. Likewise, SQL query processors have been developed to support data
stored in NoSQL data stores.

Because relations are often presented to users in the form of tables, the misunderstanding that
relational database management systems are only capable of row-oriented tabular data structures
became very widespread. Relational systems are actually much more flexible, and can store a table of
data in a column-oriented format if the database administrator so desires.

Page 4

The distinction between relations and table rows became more significant with the advent of NoSQL
technology because relations were not necessary to achieve data independence, only to achieve
adequate performance with the hardware that was then considered to be state-of-the-art. SQL
facilitates description of the logic through which data will be used in order to optimize performance of
the query processor. Data independence results from the data management systems separation of
each applications logical form for a given set of related data items from each others and from the
physical data structure.

When Google published its Bigtable whitepaper in 2006, it introduced many data professionals to a
NoSQL database management system that actually organizes data items into a table-like structure.

2 Qualities of data management systems


In data systems, the nature of an individual transaction is defined by reference to the four ACID
properties. These properties are atomicity, consistency, isolation, and durability. As the discussion
below illustrates, atomicity and durability ensure that the system behaves in a consistent manner.
Consistency, in this context, requires that the values stored in the database are in compliance with
constraints placed on them. Isolation is the traditional approach to conflict management. The modern
approach to designing how a system will process transactions is to specify a conflict management
strategy, for which isolation is one of several candidates.

Page 5

CAP Theorem addresses situations wherein transactions cannot all be processed sequentially within
applicable time constraints. It states that system design choices will involve a tradeoff between
availability and the risk of temporary inconsistency. It was previously expressed as the need for
distributed systems to make a choice of two out of three qualities: consistency, availability, and partition
tolerance. The originator of the CAP Theorem has since recognized that the first formulation, as well as
its predecessor (BASE), failed to fully capture the intended message.

2.1 ATOMICITY & DURABILITY


Atomicity is the constraint that the transaction either successfully commits or must fail as a whole. The
significance of this varies with the storage media because some media can write the entire transaction
at once, while other media require bits to be written sequentially. While atomicity prevents partial
fractions of a transaction from occurring without the whole transaction occurring, a transaction as a
whole may still be said to either abort or commit. Durability is the constraint that once the transaction
has been said to commit, it cannot be undone.
Placed in historical context, the meanings of atomicity and durability can be more apparent. Early data
transactions were often stored on punched cards. Imagine a transaction conducted by an end-user
walking up to a window at which a member of the data processing staff punches the transaction into a
punched card. Atomicity stands for the proposition that either the task is completed and the new
punched card is processed in the next batch, or the task is interrupted and the punched card is never
finished. Durability stands for the proposition that whether the first transaction commits or aborts, the
end-users only option for further affecting the database is to attempt a second transaction.
A compensating transaction is a second transaction to correct an error caused by the first transaction.
Whether the first transaction was committed in error or was wrongfully aborted, compensating
transactions allow durability to be preserved. However, compensating transactions are more practical
in some situations than in others. For example, a computer can automatically execute a compensating
transaction if it discovers an erroneous deduction from a bank account. If two customers were to be
assigned the same account number, it would be a more complicated mistake for which to compensate.

2.2 CONSISTENCY DISTINGUISHED FROM EVENTUAL CONSISTENCY


In the context of ACID, a database is said to be consistent when the constraints on the database are
satisfied. These constraints would include requirements such as where a field in one table is a foreign
key, there is a record with a matching primary key. Because ACID applies to the resolution of abstract
problems, single-copy consistency and other matters of physical structure have no meaning in this
context.
Eventual consistency is the requirement that once a transaction is committed by one node, it must
eventually be disseminated to all nodes storing data items that were affected by the transaction.
Starting in the 1990s when BASE popularized the notion of eventual consistency, the word consistent on
its own became ambiguous. Although BASE as an acronym became obsolete with the introduction of
CAP Theorem, eventual consistency remains a popular term.
Although the popularity of the term is relatively recent, data processing systems have traditionally been
designed for eventual consistency. By 1960, banks were using automated data processing systems for

Page 6

checks printed with magnetic ink. The use of batch processing meant that account balance data would
not reflect an update until a batch was processed after the financial transaction occurred. During each
batch, the processing unit would produce a list of delinquent or overdrawn accounts. At that point, a
bank manager would personally handle the exceptions. Modern ATMs function in a similar manner
when disconnected from the network, except that exception handling is now automated as well. Both
the check processing system in the 1960s and ATM systems today are eventually consistent.

2.3 CONFLICT MANAGEMENT, ISOLATION & AVAILABILITY


When more than one user has access to a common base of data at the same time, the data
management system needs a way to resolve conflict should conflicts occur, such as when multiple
application programs attempt to update the same data item at the same time. The sequence affects the
outcome.
Database management systems have traditionally isolated themselves from conflict, thereby forcing
conflict to be resolved elsewhere. Isolation protects the data technology from blame when a conflict
occurs. However, as Amazon demonstrated with Dynamo, this risk-aversion eventually became a
binding constraint on the data industry. Because conflict resolution can now be automated, data
management systems are now more likely to offer alternatives to isolation.
Escaping the bottleneck created by isolation is one strategy for achieving high availability. High
availability systems prioritize response time over avoidance of inconsistent states. Amazons Dynamo
accomplishes this by empowering every single node to handle any task that exists in the system, even if
it cannot connect to any other node. Amazon handles shopping carts in this manner on the theory that
it is better to collect the data now and compensate the customer later than to reject data and possibly
cause the customer to shop elsewhere. In doing so, Dynamo sacrifices isolation.

3 Data management systems


3.1 PUBLISHED RESEARCH
System R was an experimental relational database management system developed by IBM and installed
at several user sites for evaluation and testing. The commercial product derived from System R, called
SQL/Data System, was announced in 1981.
Bigtable is Googles distributed system for managing structured data that implements a simpler data
model than relational database management systems. Bigtable does so for the same purpose that the
relational model was developed: data independence that remains available at a scale greater than
previous technology allowed. Whereas relational systems optimize storage locations and update indices
at write time, Bigtable used immutable values with timestamps to increase scalability through NoSQL
data access.
Dynamo is Amazons distributed key-value store. Dynamo was designed to maximize availability while
minimizing latency. Dynamo uses vector clocks for conflict resolution at read time, instead of using
isolation. Because conflicts are resolved before orders are placed, this eventually consistent system
does not interfere with the use of an isolation-based conflict management strategy for the order

Page 7

processing system. Since publishing the Dynamo paper, Amazon has also launched a commercial data
service called DynamoDB.
Facebook published a paper on Cassandra in 2009 after open-sourcing Cassandra the previous year.
Cassandra has become an Apache project that is known for its continually evolving data model and the
level of support provided by DataStax. Because the data model continues to evolve, Facebooks paper
on Cassandra is less useful as a point of reference.

3.2 OPEN SOURCE DATA MANAGEMENT SYSTEMS


MariaDB is a relational database management system with NoSQL support. MariaDB started as a fork of
MySQL, which itself is the open source variant of Oracles core relational database management system.
MariaDB has been actively integrating newer technologies such as NoSQL access to data items, while
preserving its ability to be used as a drop-in replacement to MySQL.
HBase is the column-oriented Hadoop database that was modeled after Bigtable. Similar to Bigtable,
it uses a three-dimensional table-like data model, with columns and rows followed by timestamps. The
value for each column-row-timestamp key tuple is immutable once written, enabling HBase to leverage
the existing data store functionality in HDFS for concrete machine representation issues.
HBase and Bigtable both use column families as the basic unit of access control and accounting. HBase
also uses column families for compression and locality control, while Bigtable uses locality groups
containing one or more column families as the basic unit control over compression and locality.
Riak is a distributed key-value store designed for high availability and partition tolerance. Modeled after
Dynamo, it adds distributed, full-text search as well as the ability to add tags to key-value pairs through
a secondary index. Riak CS is a cloud storage application built on top of Riak that provides (Amazon) S3
API compatibility. Riak should not be confused with Redis, an in-memory key-value store that supports
associating each key with an arbitrary string or multiple arbitrary strings in the form of a (linked) list, set
or sorted set of arbitrary strings.

3.3 OPEN SOURCE QUERY PROCESSORS


Apache Hive is a tool for projecting structure onto large, distributed data sets. Users can query the data
through that projected structure using Hives own SQL-like query language, HiveQL. Hive was originally
developed at Facebook to enable people with strong SQL skills but little or no Java experience to query
data for analytic use cases; it is not designed for real-time queries or OLTP workloads. HCatalog
(formerly the Hive metastore) is the Hive component that manages the table and storage metadata (the
indices used for solving abstract problems).
Presto is a distributed SQL query engine for interactive analytic queries against a variety of data sources.
Presto supports Hive, HBase, relational database management systems and proprietary data stores.
When used with a Hive data source, it combines the metadata in HCatalog with its own query execution
engine. Like Cassandra and Hive, Presto was developed for internal use at Facebook before being open
sourced. In contrast to Cassandra and Hive, Facebook has retained control over the open source
development of Presto.

Page 8

4 Selected references by topic


4.1 DATA MANAGEMENT TECHNOLOGY
Jacob T. Schwartz, Abstract and Concrete Problems in the Theory of Files, in Courant Computer Science
Symposium 6, Data Base Systems 1 (Randall Rustin ed. 1972).
M. E. Senko, Details of a Scientific Approach to Information Systems, in Courant Computer Science
Symposium 6, Data Base Systems 143 (Randall Rustin ed. 1972).

4.2 RELATIONAL TECHNOLOGY


E.F. Codd, A Relational Model of Data for Large Shared Data Banks, 13 Communications of the ACM 377,
377 (June 1970).
Donald D. Chamberlin, A.M. Gilbert & Robert A. Yost, A History of System R and SQL/Data System, 7 Very
Large Data Bases 456 (1981).

4.3 QUALITIES OF DATA MANAGEMENT SYSTEMS


I.B.M., General Information Manual: Introduction to IBM Data Processing Systems (1960).
Jim Gray, The Transaction Concept: Virtues and Limitations, 7 Very Large Data Bases 144 (1981).
Eric Brewer, CAP Twelve Years Later: How the Rules Have Changed, Computer, February 2012 at 23.
Jonathan Ellis, Lightweight transactions in Cassandra 2.0, DataStax Developer Blog (July 23, 2013)
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0.

4.4 DATA MANAGEMENT SYSTEMS


Fay Chang et al., Bigtable: A Distributed Storage System for Structured Data, 7 Operating System Design
and Implementation 205 (2006).
Giuseppe DeCandia et al., Dynamo: Amazons Highly Available Key-value Store, 2007 Operating System
Principles 205.
Ashish Thusoo et al., Hive A Warehousing Solution Over a Map-Reduce Framework, 2 Proceedings of
the Very Large Data Bases Endowment (2009).

Page 9