Académique Documents
Professionnel Documents
Culture Documents
Warehouse
An Ovum white paper for Cloudera
SUMMARY
Catalyst
Surging data volumes are stressing traditional enterprise data warehouse (DW) and business
intelligence (BI) architectures. As data volumes and sources proliferate, IT organizations are
seeking new flexibility for running their data warehousing environments to reduce cost; increase
availability for query, analytics, and reporting; and provide flexibility for tapping new or additional
data sources for intelligence. While existing relational data warehouse architectures have served
enterprises well, organizations require more flexibility to meet their changing needs, while
leveraging the tools, skillsets, and practices that they have already established.
Ovum view
Hadoop is steadily maturing into an enterprise data platform that is adding the SQL compatibility,
interactivity, manageability, and security that IT organizations expect. A versatile, economical
platform for scalable data processing, Hadoop can help organizations better meet their SLAs while
accommodating a wide range of analytic, exploration, query, and transformation workloads.
Hadoop is evolving from its batch processing origins into a flexible, economical hub where
organizations store original data, keep archival data active, and grow their options for data
exploration, modeling, analysis, and reporting. Hadoop will complement, not replace your data
warehouse and BI infrastructure, providing new flexibility for generating insights as your business
requirements change.
Key messages
Hadoop offers an economical platform to offload data transformation cycles, and can
potentially simplify and reduce errors in the process
Hadoop is rapidly adding features that provide SQL compatibility that businesses
need and the hardening that IT organizations expect
Although Hadoop will not replace your enterprise data warehouse, the time to start
planning is now for Hadoop to assume greater proportions of your enterprise DW
workload
Page 1
THE BI BOTTLENECK
Figure 1. Traditional multi-tier BI/Data warehousing architecture
Source: Ovum
Source: Ovum
Regulatory mandates compel them to retain the data online, forcing them to budget
for continually spiraling licensing and storage costs.
Regulatory mandates dictate that data be retained, but not necessarily live. The
typical choice is to move the data to archive, where the data is accessible only for
restoration processes and is therefore cannot be used for analytics. For budgeting
purposes, the result is sunk costs with negligible economic return.
Yet, what if your organization could make that archive active, and do so at a small fraction of the
cost of a data warehouse? By analyzing customer behavior over 7 10 year periods, as opposed
to the more typical 1 3 years that are typically available, it could yield valuable insights such as:
What unexpected insights might arise after mining 10 years of data and associating it
with different customer or market indicators?
linear performance with off-the-shelf, affordable hardware deployed in clustered, compute grids
numbering thousands of nodes, with highly fault-tolerant architectures that were designed to work
around failure of isolated nodes.
Based on the core technologies originated by Google including the Google File System (GFS) and
MapReduce, Internet companies such as Yahoo, Facebook, Twitter, LinkedIn and others formed
the Hadoop community within the Apache Software Foundation to develop the platform for wider
scale adoption.
Use case
Industry(s)
Examples
Risk mitigation
Financial Services
Operational efficiency
Manufacturing, logistics,
Utilities, Municipal/Regional
infrastructure
Source: Ovum
This scenario evolved incrementally, beginning with expansion of analytics beyond traditional
structured data to text analytics. An Ovum 2011 global survey of large enterprise data warehouse
users (with data stores exceeding a terabyte) revealed that 55% were routinely using text
analytics, mostly from email. From there, organizations across different sectors discovered data
sources to be proliferating. For instance:
Mobile telco carriers found their markets growing more complex. No longer restricted
to offering voice services, their networks were now being used by subscribers for
texting, web access, shooting and sharing videos, gaming, and making electronic
payments. To manage their networks, call detail records (CDRs) were no longer
adequate; they now had to track web logs, location-based sensory data, and social
network activity to better understand how, and with whom, their subscribers were
interacting to optimize their networks and solidify or build customer base.
Energy and natural resource companies have always worked with large volumes of
data. As they explore new sources in less accessible or environmentally sensitive
locations, they require even more data to make the right decision on where and how
to extract the resource. The stakes rise given that the new locations may require
drilling thousands of feet below the surface to access new oil or gas finds; data is the
hedge to ensure that their expensive bets prove out. For instance, one leading
offshore driller is charting seismic data in five dimensions to identify the best locations
to drill for oil.
In each case, the volume of data and varying structure of data would overwhelm the capability of
conventional SQL platforms and exhaust the patience and budgets of IT organizations responsible
for maintaining them. Furthermore, the SQL query language does not lend itself well to complex
text, sentiment, time series, or abstract algorithmic analyses. While SQL continues to deliver value
and carries a large skills base, there was the need of a new generation of platforms to supplement
SQL picking up where conventional data warehousing systems left off.
Page 6
Source: Ovum
Batch processing for moving data into downstream systems, such as a data
warehouse or mart;
SQL querying for exploring and understanding your data through familiar BI tools;
Applying data mining and machine learning libraries for gaining new insights from your
data and driving competitive advantage; and
Providing the ability to extend the platform for things vendors havent built.
Page 7
Enterprises get the best of both worlds. They gain the flexibility and scalability of a platform that
can accept heterogeneous data and support a late-binding approach to modeling data. They also
gain the familiarity of the SQL environment, making Hadoop accessible to the existing base of SQL
developers and BI tools. And with a platform that accepts familiar SQL queries and new forms of
analytics, organizations gain a platform that will evolve with their business needs.
Scalable data processing
Hadoop can perform large data manipulation and transformation tasks at lower cost compared to
most commercial data warehousing platforms. It allows the running of highly complex problems
involving very large sets of data that would otherwise be cost-prohibitive to run on SQL data
warehouses. Similarly, it allows offloading of highly compute-intensive tasks, allowing more
expensive SQL platforms to focus on the highest value workloads. As noted earlier, data
transformation is an excellent candidate, as Hadoop allows enterprises to tap the power of
MapReduce processing to perform these operations. With data transformation offloaded to
Hadoop, your SQL data warehouse can accept more queries, run more analytics problems, and
generate more reports.
Extensibility
Hadoops capability for storing data in volume and with high variety (variable structure) supports
extensibility in several ways.
First, the ability for retaining large volumes of data inexpensively allows raw data and transformed
data to be kept side-by-side, because the storage of data is inexpensive. That allows data schema
to evolve over time as new sources of data become available, operational parameters or key
performance indicators evolve, or the competitive landscape changes. This is a sharp contrast to
relational platforms, where the database is modeled in advance based on anticipated queries, and
taken offline if and when the data models are changed.
There are numerous examples underscoring the benefit of allowing data models to evolve with the
business. For instance, mobile carriers can enrich call detail records with other interactions such
as web logs as their business changes, or logistics providers can supplement auto ID data with
GPS readings to gain more granular snapshots of their transport operations.
Secondly, the ability to accommodate variably structured data allows organizations to gain visibility
into data and data sources that were traditionally outside the reach of SQL data warehouses. For
instance, customer data can now be enriched with social media data without taking the system
down; similarly, operational data can be enriched with data from new sensors that are coming
online. As data sources change, their impact on schema can be tested in sandboxed mode without
impacting the online data set, then brought into production on a staged basis.;
Staging
Although Hadoop was designed to accommodate variably structured data, that doesnt eliminate
the need for transformation processes. In the enterprise, Hadoops initial role has been as data
refining platform; typically data is transformed by coding MapReduce programs to perform the
Page 8
task. With commercial ETL tools supporting native operation inside Hadoop, generation of these
MapReduce data transformation routines is becoming automated.
Hadoop as staging platform has additional benefits. As a lower cost environment, Hadoop is wellsuited for moving resource-intensive transform cycles away from more expensive SQL data
warehousing environments. Additionally, Hadoops fault-tolerant architecture adds more reliability
and recoverability to the process.
As a lower cost platform, Hadoop is well-suited for performing the transformation and loading
operations that would otherwise run on more expensive, relational platforms. It is also more
reliable; the fault-tolerant architecture built into Hadoop enables transformation jobs to get
completed, even if some nodes go down. More importantly, if you must change or correct your ETL
routine after the fact, you have several options: you can temporarily mobilize additional nodes to
perform the corrective work, or you can re-run the entire transform from scratch, because Hadoop
is scalable and inexpensive enough to store the original raw data side by side with the transformed
copy.
Flexible division of labor
By performing data transformations in Hadoop, your organization can get the best of both worlds:
access to the inexpensive processing power of Hadoop and the familiarity of their existing BI/data
warehousing environments. It can also gain flexibility regarding placement of data; it can remain in
Hadoop or be moved only when necessary. Hadoop allows your organization to take advantage of
the flexibility that results when data transformation is decoupled from data movement. Once data is
transformed inside Hadoop, you can rationalize the division of labor, matching analytic workloads
to the right target. Some of the options include:
Querying Hadoop interactively, thanks to emerging frameworks that provide highperformance alternatives to MapReduce.
With Hadoop paired alongside your existing data warehouse/data mart infrastructure, your
organization can keep its options open regarding how, where, and when it runs analytics. Your
organization can leverage the strength of each platform, while enabling it to take advantage of new
processing frameworks and capabilities that are becoming available on Hadoop. With Hadoop,
your organization can avoid facing the trade-off between transform compute cycles and data
richness.
Active archiving
Hadoop enables organizations to take maximum advantage of the ongoing decline in storage costs
to all forms of media. Although Hadoop itself will not replace archiving media or processes, it
provides a cost-effective alternative for keeping older data from data warehouse and OLTP
Page 9
systems, which would otherwise be moved to archive, online. This can prove especially valuable
for:
Credit providers, with long-term customer relationships, who gain better visibility over
spending patterns across multiple economic cycles;
Financial services firms, who gain the capability to uncover long-running patterns of
fraud or security breaches when examining data covering extended periods;
Climatological research bodies, who gain full visibility over long-term meteorological
data, yielding insights on climate trends and its impacts on different sectors of the
economy.
Data exploration
Hadoops ability to store variably-structured data allows it to ingest information and feeds from a
much broader array of sources. This allows exploration and analysis of data traditionally off-limits
to relational systems, either because of its size, complex structure, or variability or all of the
above. For instance, Hadoop allows data types such as satellite imagery or acoustical sounds to
be queried. It provides a scalable platform for running powerful frameworks that can profile,
decipher structure from, and/or apply structure to unstructured or variably structured data.
Hadoop allows programmatic parsing and analysis of the data for discovering potentially significant
statistical patterns that can yield valuable insight on an organizations competitive landscape; its
customers; their usage and preferences for its products; and other phenomena. Exploiting the
Variety V of Big Data, Hadoop can be used as a platform for data scientists and statisticians to
freely explore data to identify new analytic problems; uncover counter-intuitive insights; and
discover the implicit structure and importance of little-known data sets.
Data exploration in Hadoop can also make business analysts more productive with traditional data
modeling processes. Hadoops flexible schema-on-read capability allows SQL engines such as
Impala to present raw data for quick viewing in a familiar self-service BI tool. There, analysts can
iteratively refine and cleanse the data to develop a star schema and identify what data should be
pushed out to data warehouses or data marts.
Platform evolution
Hadoop is rapidly evolving to add the capabilities that enterprises expect from their data
management platforms. It is becoming a multi-faceted platform that can support a wide range of
processing patterns, and it is becoming more robust.
Processing approaches
From its roots as a platform for running MapReduce-style batch processing, numerous alternatives
are emerging to increase Hadoops versatility including:
Interactive SQL query, which makes Hadoop accessible to existing enterprise SQLbased query and reporting tools;
Page 10
Streaming data processing, which transforms Hadoop into a real-time platform with
capability to ingest and analyze high-velocity data;
Hardening
With a rich, multi-decade heritage, the SQL platform has accumulated an extensive array of tools,
technologies, and practices for managing access and protecting data. Enterprises that are
integrating into their analytic data infrastructure expect no less of Hadoop.
Tools and technologies for securing, regulating data access, and overseeing data activity are
rapidly emerging in the Hadoop environment. A rich third party vendor ecosystem is evolving to
deliver value-add to the core open source platform, enabling:
Monitoring and management of activity in the use and processing of Hadoop data;
While not as mature as the SQL environment, the tools, technologies, practices, and technology
provider ecosystem for Hadoop are evolving rapidly.
Page 11
new levels by extending their offerings to venture beyond interfacing to Hadoop to operating
natively within it.
Clouderas introduction of the Impala open source framework takes Hadoop-SQL convergence to
the next level. Impala, an Apache open source project developed by Cloudera, brings interactive
SQL query directly to Hadoop. It offers a high-performance, massively parallel processing
framework that works against any Hadoop file format. While Impala utilizes the Hive metadata
store, it provides a higher-performance alternative to relying on batch-oriented MapReduce and
Hive processes. Commercial support and management automation for Impala will soon be
released by Cloudera as an optional upgrade to Cloudera Enterprise under the Real-Time Query
(Cloudera Enterprise RTQ) subscription package.
The introduction of Impala adds new options for enterprises in optimizing their data warehousing
strategies. While Impala is not intended to replace your enterprise data warehouse, data mart, or
conventional BI reporting, it is well suited for exploratory analytics where new data sources can be
rapidly transformed and exposed to self-service BI. Impala also lends itself to helping business
analysts iterate modeling for data that may eventually be migrated to a data warehouse. New
frameworks such as Impala are broadening the Hadoop platform well beyond its batch processing
roots. It allows enterprises to broaden their options for analytics, using criteria such as cost,
scope, and delivery requirements for determining the right platform for the right query.
Manageability
Starting with deployment and system health
Cloudera has also been active in bringing to Hadoop the management capabilities that have long
been customary in the relational database world. It introduced Cloudera Manager, a tool for
automating deployment and monitoring the health of Hadoop operation. Cloudera developed the
management tool because open source Hadoop, developed as a series of components, was not a
centralized, cohesive piece of software that could be deployed with a single install.
Originally developed to simplify deployment and management at the system level, and enhance
Hadoop stability, Cloudera Manager automates configuration of Hadoop components; supports
development of deployment workflows; configures start and restart services; and monitors the
overall health of the system.
Recent enhancements focusing on uptime and data handling
Cloudera Navigator, a new feature of Cloudera Manager, tracks how data is utilized; specifically, it
compiles an audit trail detailing what operations were performed against specific pieces of data, by
whom, and when. In its initial release, Navigator will track activity against HDFS, Hive, and HBase.
This capability can be helpful, both for data exploration and security. By understanding how data is
consumed, organizations can understand its value and role in supporting specific business
activities. Understanding data utilization provides can help your organization pinpoint what policies
or security measures should be associated with specific categories of data. Another new feature in
Cloudera Manager adds support for rolling upgrades, enabling side-by-side deployment of patches
Page 12
without the need to take the entire Hadoop instance offline. This capability can help IT
organizations significantly reduced the amount of scheduled downtime.
Cloudera has also recently introduced a new Backup and Disaster Recovery module (Cloudera
Enterprise BDR) to automate recovery processes. BDR makes Hadoop replication processes more
robust by ensuring that metadata is also replicated. In so doing, it not ensures that data and
metadata remain in sync, but that system recoveries can be executed consistently.
Eliminating contention between ad hoc vs. standard query and reporting activity;
Implementing Active Archiving that derives new value from aging data by providing a
cost-effective alternative for keeping such data online; and/or
Eliminating or reducing the need to continually augment data warehouses with new
costly technology infrastructure simply to meet SLAs
At the IT organization level, leverage and extend, rather than force replacement of
existing IT skillsets. Playing friendly with the large base of enterprise SQL expertise
Page 13
has become a hot spot for vendor activity. With its existing partnerships and new
initiatives such as Impala, Cloudera has been actively pursuing this goal.
At the data center level, coexist with IT infrastructure, interoperating and integrating
with existing BI and data warehousing platforms and tools.
Clearly, the Hadoop platform and the capabilities for hardening it for the enterprise are works in
progress. SQL already has a 30-year head start when it comes to development of data platforms
along with tooling for integrating data, profiling it, cleansing it, safeguarding access to it, and
implementing a policy-driven lifecycle for it. However, development of core capabilities for making
Hadoop enterprise-ready is ramping up fast. The emergence of Hadoop is following a script that is
similar to what occurred with BI and data warehousing back in the mid-1990s, where tooling and
practices matured within a 2 3 year period. History is repeating itself with Hadoop. For instance,
today there are tools that track data activity; tomorrow there will be tools that selectively provide
role-based access.
Author
Tony Baer, Principal Analyst, Ovum IT Enterprise Solutions
tony.baer@ovum.com
Ovum Consulting
We hope that this analysis will help you make informed and imaginative business decisions. If you
have further requirements, Ovums consulting team may be able to help you. For more information
about Ovums consulting capabilities, please contact us directly at consulting@ovum.com.
Page 14
Disclaimer
All Rights Reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form by any means, electronic, mechanical, photocopying, recording, or otherwise, without the
prior permission of the publisher, Ovum (an Informa business).
The facts of this report are believed to be correct at the time of publication but cannot be
guaranteed. Please note that the findings, conclusions, and recommendations that Ovum delivers
will be based on information gathered in good faith from both primary and secondary sources,
whose accuracy we are not always in a position to guarantee. As such Ovum can accept no
liability whatever for actions taken based on any information that may subsequently prove to be
incorrect.
Page 15