Académique Documents
Professionnel Documents
Culture Documents
Figure 1
In the diagram above there are three model layers. The physical data layer is the
data we have in a big data system. It can have different data types such as
video, audio, logs, business tables, and so on. The data modeling layer is the
abstract data model we build to manage physical data. The computing
modeling layer is the application layer that we build to retrieve information for
business value. With these three models, we build data models to separate
physical data and data use. This means, the application is able to access data
through the data model instead of accessing the physical data. This makes
applications flexible and data manageable.
To construct a big data model, we must first create data blocks based on data
storage, data type, relationship, read-write requirement, and so on. Further, we
must have modeling applications maintain these models, so that data models
are able to display and store the latest data.
Hybrid Data Modeling
Though NoSQL databases evolved to resolve specific issues of managing big
data, there are good reasons for SQLs enduring popularity, and current trends
are to provide both SQL and NoSQL features in the same database. Some
traditional RDBMS vendors, such as Microsoft and Oracle, provide NoSQL
features in recent releases, such as the columnar storage support in Microsoft
SQL Server 2012 and Teradata Warehouse 14. From the NoSQL side, projects like
the Hive subproject of Hadoop and the CQL query language of Cassandra, have
grown quickly to provide an SQL-like faade to a NoSQL database. Google, the
leader in big data, is also developing a hybrid distributed database, Megastore,
that can run SQL-specific features with a NoSQL storage specification. Thats not
surprising - both the quantity and variety of data are growing rapidly and we
need many models and tools to process it. In the past we tried to cram data of
all types into an RDBMS; now we must keep some of it out.
As Big Data drives changes in the collection and processing of data, we need to
reconsider existing data and storage models to allow for the increasing
importance of implementation issues such as the performance decrease
imposed by join operations and pressures on storage space as data grows to
exceed the capacity of hardware storage. There are new options to resolve these
issues--we can elect to migrate big data into a NoSQL database, possibly
running a hybrid system as a tradeoff, taking advantage of the low risks and
costs of these promising new options.
changes and generate a code template to build a bridge between the current
RDBMS and a target NoSQL system. This assists with data migration from an
RDBMS to a NoSQL database, and gives the whole picture for data deployed in
various data systems.
The following diagram illustrates data migration design:
Figure 2
Data Migration
NoSQL Data:
Data with the following characteristics is well-suited for a NoSQL system:
1
2
3
4
RDBMS Data:
Data with the following characteristics might be better suited for a traditional
RDBMS:
5
6
7
8
Data Consumption
It often takes more effort to migrate data consumers to new storage in new
system than to migrate the data itself. Once we have used a hybrid model to
migrate data, applications designed to access the data using relational
methodology must be redesigned to adapt to the new data architecture.
Hopefully, future hybrid databases will provide built-in migration tools to help
leverage the potential of NoSQL. In the present, remodeling data is critical
before an existing system is migrated to NoSQL. In addition, modeling tools
must to be evaluated to verify they meet the new requirements.
Physical Data-aware Modeling
For the past 30 years, structure first, collect later has been the data modeling
approach. With this approach, we determine the data structure before any data
goes into the data system. In other words, the data structure definition is
determined by how the data is used. With big data, this traditional approach is
no longer applicable, as data doesnt necessarily have fixed schemas or formats.
As a result, we need a new modeling methodology to suit big data
characteristics and deployment.
I propose a new approach for managing the variety of big data.
Dynamic Schema Definition
Even with big unstructured or semi-structured data, we still need to define
schemas because data relationships can be more complex than before. Hiding
data relationship logic in a program is not a good way to manage data
complexity. Because big data uses the structure later approach, in most of the
cases, we can only know the data schema after the data has been created.
The proposed methodology is to define schema on existing data after it has
been collected and stored, and use the schema to retrieve data at runtime while
processing. In the definition, the schema function should be as atomic and
isolated as possible. To achieve this, we must find out the scope of the data the
schema function applies to, and the versions of data that the schema function
can work for. This is called dynamic schema to distinguish it from traditional
fixed schema defined before data is collected.
The following is an example of a schema function showing how to fetch
BirthDate and PostCode from an identity number.
Figure 3
Data Block
A data block is the physical data defined with metadata. We must know the
actual location of data, so we can model it on a modeling layer, and so the
computing program knows where to retrieve it.
Include the following fields in the metadata definition:
As a general rule to manage similar types of data, place data of identical format
make the Mapper and Reducer pair more independent of physical data, so that we can better reuse these computing units
to accelerate development and reduce cost.
Computing Flow
As the following diagram shows, rather than writing bigger Mappers and Reducer, we can combine existing Mappers and
Reducers in a workflow. The output of the first Map-Reduce is the input of the second Map-Reduce. The whole work flow
can be visualized and managed based on the data model introduced above.
Data computing modeling helps manage the
computing algorithm on the data stored in raw format.
When new tools are developed, the data model will
assist in taking the computing task forward, and help
monitor the progress of data processing. Meanwhile,
we can embed the data model into a big data system
like Hadoop. Once data is moved or imported, Hadoop
can synchronize with the data model first.
Integration with Hive
Figure 4
Conclusion
The complexity of a relational database limits the scalability of data storage, but makes it very easy to query data through
an SQL engine. Big data NoSQL systems have the opposite characteristics: unlimited scalability with more limited query
capabilities. The challenge of big data is querying data easily. Creating data models on physical data and computing path
help manage raw data. he future will bring more hybrid systems combining the attributes of both approaches.
Meanwhile, the dynamic schema model proposed in this paper and systems such as Hive offer help in the challenge of
managing big data.
References
NoSQL Data Modeling Techniques http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
Opetarational Data, Hadoop and New Modeling
http://erwin.com/expert_blogs/detail/opetarational_data_hadoop_and_new_modeling/
Big data http://en.wikipedia.org/wiki/Big_data
NOTICES
Copyright 2012 CA. All rights reserved. All trademarks, trade names, service marks and logos referenced herein belong to their respective companies.
The information in this publication could include typographical errors or technical inaccuracies, and CA, Inc. (CA) and the authors assume no responsibility for its accuracy or completeness.
The statements and opinions expressed in this publication are those of the authors and are not necessarily those of CA.
Certain information in this publication may outline CAs general product direction. However, CA may make modifications to any CA product, software program, service, method or procedure described in this publication at any time without notice, and the development, release and timing of any features or functionality described in this publication remain at CAs sole discretion. CA
will support only the referenced products in accordance with (i) the documentation and specifications provided with the referenced product, and (ii) CAs then-current maintenance and support
policy for the referenced product. Notwithstanding anything in this publication to the contrary, this publication shall not: (i) constitute product documentation or specifications under any existing or future written license agreement or services agreement relating to any CA software product, or be subject to any warranty set forth in any such written agreement; (ii) serve to affect the
rights and/or obligations of CA or its licensees under any existing or future written license agreement or services agreement relating to any CA software product; or (iii) serve to amend any
product documentation or specifications for any CA software product.
Any reference in this publication to third-party products and websites is provided for convenience only and shall not serve as the authors or CAs endorsement of such products or websites.
Your use of such products, websites, any information regarding such products or any materials provided with such products or on such websites shall be at your own risk.
To the extent permitted by applicable law, the content of this publication is provided AS IS without warranty of any kind, including, without limitation, any implied warranties of merchantability, fitness for a particular purpose, or non-infringement. In no event will the authors or CA be liable for any loss or damage, direct or indirect, arising from or related to the use of this publication, including, without limitation, lost profits, lost investment, business interruption, goodwill or lost data, even if expressly advised in advance of the possibility of such damages. Neither the
content of this publication nor any software product or service referenced herein serves as a substitute for your compliance with any laws (including but not limited to any act, statute, regulation, rule, directive, standard, policy, administrative order, executive order, and so on (collectively, Laws) referenced herein or otherwise or any contract obligations with any third parties. You
should consult with competent legal counsel regarding any such Laws or contract obligations.