Académique Documents
Professionnel Documents
Culture Documents
December 2013
By Richard Bellamy
ABSTRACT: The proliferation of NoSQL data management tools has resulted in a large number of
products without a well-documented framework in which to classify them. The proponents of opensource-based data management systems often lack incentive to produce such documentation for
technical novices because they often derive their revenue from supporting the users of their systems.
Although journalists can provide a more objective source of information, their target audience is
generally working professionals. In this paper, I provide a vendor-agnostic introduction to the types of
data management technologies and the qualities that each can have. This paper is targeted at students
of data management that are looking for a framework in which to structure their study of data
management technologies.
CONTENTS
1
1.2
1.3
1.4
Relations........................................................................................................................................ 4
1.5
2.2
2.3
3.2
3.3
4.2
4.3
4.4
Page 1
To state the obvious, a path by which data is accessed is an access path. The less obvious insight here is
that the convergence point is data, and the machine representation is also data. Therefore, there is an
access path from each logical data structure to the data item and another one from the data item to its
machine representation. It is also correct to refer to the whole path from a logical data structure to the
data items machine representation as an access path. The failure to distinguish between these access
paths and identify to which access path one was referring has been a recurring source of communication
Page 2
failure. One such occasion was a computer science symposium in 1971 at which much of the foundation
for the traditional relational database management system was developed.
A system that receives requests for information from application programs and maps those requests to
the data items convergence point is a database management system. Once the access paths have
converged on a single data item, that data item could theoretically be stored in a large, single-value
store. The access path, at that point, becomes dependent on the hierarchy of the physical storage
structure. For this reason the problems of mapping from data item to machine representation have
been called both concrete problems and hierarchy-dependent problems.
It has traditionally been the norm for database management systems to function as a black-box from the
perspective of the application program, and to internally manage the mapping from the point of
convergence to the data store. This was previously necessary to achieve adequate performance.
The concept of a large, single-value store was only a theoretical tool when data management theory was
first being developed. At the time, machine representations were typically stored in file systems, and
large single-value stores were thought to be unrealistic. They have become common, but are now called
key-value stores. The term data store properly refers to systems that solve concrete problems, such as
file systems and key-value stores. Data store is also used by some database management systems that
prefer to market themselves as data stores.
Page 3
1.4 RELATIONS
While hierarchical data provides data independence in theory, and today provides it in practice as well,
the performance of IMS was not initially sufficient in many cases to allow the data to be queried for
report generation except in the manner in which the data was stored. A retailer wishing to analyze past
transactions would have a hard time discovering relationships in the data retrospectively. Data
independence would need to be accomplished by some other means until hardware progressed.
The data industry settled on the use of relations to overcome the performance bottleneck, and
developed SQL to facilitate description of the characteristics of the data that would be most useful for
optimizing query performance. Relational database management systems became synonymous with
data independence, SQL and the routing of all data through query processors early enough that many
people were years into their career before they needed to distinguish between these four separate
characteristics of data management technologies.
Because relations are often presented to users in the form of tables, the misunderstanding that
relational database management systems are only capable of row-oriented tabular data structures
became very widespread. Relational systems are actually much more flexible, and can store a table of
data in a column-oriented format if the database administrator so desires.
Page 4
The distinction between relations and table rows became more significant with the advent of NoSQL
technology because relations were not necessary to achieve data independence, only to achieve
adequate performance with the hardware that was then considered to be state-of-the-art. SQL
facilitates description of the logic through which data will be used in order to optimize performance of
the query processor. Data independence results from the data management systems separation of
each applications logical form for a given set of related data items from each others and from the
physical data structure.
When Google published its Bigtable whitepaper in 2006, it introduced many data professionals to a
NoSQL database management system that actually organizes data items into a table-like structure.
Page 5
CAP Theorem addresses situations wherein transactions cannot all be processed sequentially within
applicable time constraints. It states that system design choices will involve a tradeoff between
availability and the risk of temporary inconsistency. It was previously expressed as the need for
distributed systems to make a choice of two out of three qualities: consistency, availability, and partition
tolerance. The originator of the CAP Theorem has since recognized that the first formulation, as well as
its predecessor (BASE), failed to fully capture the intended message.
Page 6
checks printed with magnetic ink. The use of batch processing meant that account balance data would
not reflect an update until a batch was processed after the financial transaction occurred. During each
batch, the processing unit would produce a list of delinquent or overdrawn accounts. At that point, a
bank manager would personally handle the exceptions. Modern ATMs function in a similar manner
when disconnected from the network, except that exception handling is now automated as well. Both
the check processing system in the 1960s and ATM systems today are eventually consistent.
Page 7
processing system. Since publishing the Dynamo paper, Amazon has also launched a commercial data
service called DynamoDB.
Facebook published a paper on Cassandra in 2009 after open-sourcing Cassandra the previous year.
Cassandra has become an Apache project that is known for its continually evolving data model and the
level of support provided by DataStax. Because the data model continues to evolve, Facebooks paper
on Cassandra is less useful as a point of reference.
Page 8
Page 9