Vous êtes sur la page 1sur 15

Hadoop: Extending Your Data

Warehouse
An Ovum white paper for Cloudera
SUMMARY
Catalyst
Surging data volumes are stressing traditional enterprise data warehouse (DW) and business
intelligence (BI) architectures. As data volumes and sources proliferate, IT organizations are
seeking new flexibility for running their data warehousing environments to reduce cost; increase
availability for query, analytics, and reporting; and provide flexibility for tapping new or additional
data sources for intelligence. While existing relational data warehouse architectures have served
enterprises well, organizations require more flexibility to meet their changing needs, while
leveraging the tools, skillsets, and practices that they have already established.

Ovum view
Hadoop is steadily maturing into an enterprise data platform that is adding the SQL compatibility,
interactivity, manageability, and security that IT organizations expect. A versatile, economical
platform for scalable data processing, Hadoop can help organizations better meet their SLAs while
accommodating a wide range of analytic, exploration, query, and transformation workloads.
Hadoop is evolving from its batch processing origins into a flexible, economical hub where
organizations store original data, keep archival data active, and grow their options for data
exploration, modeling, analysis, and reporting. Hadoop will complement, not replace your data
warehouse and BI infrastructure, providing new flexibility for generating insights as your business
requirements change.

Key messages

Hadoop offers an economical platform to offload data transformation cycles, and can
potentially simplify and reduce errors in the process

Hadoop adds extensibility to enterprise analytics

Hadoop is rapidly adding features that provide SQL compatibility that businesses
need and the hardening that IT organizations expect

Although Hadoop will not replace your enterprise data warehouse, the time to start
planning is now for Hadoop to assume greater proportions of your enterprise DW
workload
Page 1

THE BI BOTTLENECK
Figure 1. Traditional multi-tier BI/Data warehousing architecture

Source: Ovum

The rationale for multi-tiered DW architecture


Because analytic workloads differ from transactional ones, BI and data warehousing systems have
evolved as a separate data tier for most enterprises, as shown in Figure 1. The typical BI/data
warehousing architecture was based on structured data, with online transaction processing (OLTP)
relational databases being the most common information sources. In most cases, a separate
staging server was deployed to performed specialized extract, transformation, and load (ETL)
operations that sourced data from transaction systems, cleansed and converted it, and loaded it to
data warehouse or data mart targets.
This architecture was optimized for the volumes of data in the megabyte/gigabyte range that
predominated most enterprise OLTP (online transaction processing) database installations of the
1990s and early 2000s. Utilizing a staging server minimized the impact on source and target
systems by abstracting compute-intensive data transformation cycles. The operations were
typically scheduled either as off-hour batch or semi-continuous trickle feeds. The drawback of
this architecture, however, was architectural complexity that came with the addition of a middle
data staging tier. Furthermore, as data volumes hit the terabyte range, the time and overhead
associated with moving such blocks of data itself became a huge overhead drain.
Page 2

Figure 2. The ELT pattern

Source: Ovum

The emergence of the Extract, Load, Transform (ELT) pattern


Surging data volumes drove the need to flatten BI architecture, shifting data transformation loads
onto the target system to a pattern known as Extract/Load/Transform (ELT), as shown in Figure 2.
The obvious advantage is that data movements are reduced; only a single movement from source
to target was needed. The transformation workload was co-located to where the data was stored
and analyzed. The emergence of ELT reflected the reality that batch windows were shrinking; as
enterprises globalized and faced the need for 24 x 7 operation, less and less time became
available for batch operations.
However, because the platform running ELT is also the same platform that performs analytics and
reporting, some tradeoffs become inevitable; the same set of resources must be carved up to
accommodate each of these workloads. As data volumes grow to terabyte range and data sources
and types grow more diverse, SLAs for each workload become jeopardized. Transforming
terabytes of data requires more processing resource; in turn, transforming diverse data sources
such from text, rich media, or machine data is more complex than with traditional structured data,
requiring more compute cycles. Under this scenario, satisfying SLAs triggers a renewed cycle of
investment in higher cost infrastructure. For instance, Fibre Channel may be chosen over standard
1 gigabit (GbE) Ethernet for accelerating internal movement of data to and from disk, to speed one
part of the process.
Page 3

Enterprise DWs have their limits


Although some conventional enterprise data warehouses have exceeded a petabyte in size, most
were not economically designed to handle volumes of data beyond the gigabyte or terabyte level.
With typical licensing and storage costs of commercial enterprise data warehouses ranging from
$20,000 - $50,000 per terabyte, most organizations face difficult choices as they balance the
opportunities of deriving new insights from wider data samples against one (or more) of the
following scenarios:

Regulatory mandates compel them to retain the data online, forcing them to budget
for continually spiraling licensing and storage costs.

Regulatory mandates dictate that data be retained, but not necessarily live. The
typical choice is to move the data to archive, where the data is accessible only for
restoration processes and is therefore cannot be used for analytics. For budgeting
purposes, the result is sunk costs with negligible economic return.

There are no regulatory requirements; in most cases, the economics of existing


commercial SQL data warehousing platforms does not yet make it cost-effective to
keep the data live.

Yet, what if your organization could make that archive active, and do so at a small fraction of the
cost of a data warehouse? By analyzing customer behavior over 7 10 year periods, as opposed
to the more typical 1 3 years that are typically available, it could yield valuable insights such as:

What is the core propensity to buy that is independent of economic cycle?

How do the buying habits of different demographic groups, segmented by age,


gender, region, or income, vary by stage of economic or product innovation cycle?

What promotional or cross-selling strategies work best for different demographic


groups during upturns or downturns in the economy?

What unexpected insights might arise after mining 10 years of data and associating it
with different customer or market indicators?

HADOOP THE PROMISE FOR ENTERPRISE DATA


WAREHOUSING
High-volume data processing origins
Hadoop emerged as a data processing framework designed to solve unique, Internet-scale
operational problems such as generating large search indices, customizing landing pages, and
optimizing online ad placement. The magnitudes of data demanded a new approach. For instance,
Facebook found itself exhausting capabilities of its SQL data warehouse as cycle times for daily
refreshes were stretching past 24 hours. Likewise, maintaining database schema grew
cumbersome owing to the variety of formats of incoming data especially when text data and log
files were involved. The scale-out, architectures featuring industry standard hardware developed
for Internet data centers made Hadoop possible. They proved the feasibility of achieving virtually
Page 4

linear performance with off-the-shelf, affordable hardware deployed in clustered, compute grids
numbering thousands of nodes, with highly fault-tolerant architectures that were designed to work
around failure of isolated nodes.
Based on the core technologies originated by Google including the Google File System (GFS) and
MapReduce, Internet companies such as Yahoo, Facebook, Twitter, LinkedIn and others formed
the Hadoop community within the Apache Software Foundation to develop the platform for wider
scale adoption.

Table 1. Hadoop: typical early enterprise use cases

Use case

Industry(s)

Examples

Data sources, feeds

Customer holistic view

Mass Marketers, Digital


advertisers

Predictive churn analysis,


Upsell/Cross-Sell, Crosschannel identity resolution

CRM, Call center records,


email, web logs, social
media data, location data

Risk mitigation

Financial Services

Fraud detection, counterparty risk management,


credit scoring

Capital market feeds,


macroeconomic data, web
logs, transaction data,
social media data

Operational efficiency

Manufacturing, logistics,
Utilities, Municipal/Regional
infrastructure

Supply chain optimization,


smart grids, smart urban
infrastructure

Machine data from


distributed, smart sensors
connected via wired &
wireless broadband

Source: Ovum

Core enterprise use cases


Doug Cutting and others developed Hadoop when he was working at Yahoo; he subsequently
joined Cloudera as chief architect in 2009 and currently chairs the Apache Software Foundation.
Hadoop delivered significant performance benefits for early adopters; for instance, at Yahoo,
transformation workloads that formerly required eight hours shrank to 15 minutes. It also enabled
Yahoo to significantly improve click-through rates from member landing pages by allowing
optimizing of links through analysis of all, rather than a sampling of member navigation.
While Hadoop was first implemented by Internet companies to handle operational problems that
were unique to their industry, enterprises in sectors such as financial services,
telecommunications, retail, and media subsequently discovered the power of the platform for
analytics. With Hadoop, enterprises could vastly expand the scale and scope of data analyzed and
do so at an economical cost; in many cases, they could extend analytics to all of the data, rather
than samples. Thanks to the flexibility of Hadoop, your organization could evolve its analytics
needs without having to structure data and define queries early, when the database is being
designed.
Although Hadoop was a new analytics platform for the enterprise, Ovum has found that early use
cases have been highly familiar, largely concentrated in the core areas of customer maintenance,
risk mitigation, and operational efficiency, as shown in Table 1.
Page 5

This scenario evolved incrementally, beginning with expansion of analytics beyond traditional
structured data to text analytics. An Ovum 2011 global survey of large enterprise data warehouse
users (with data stores exceeding a terabyte) revealed that 55% were routinely using text
analytics, mostly from email. From there, organizations across different sectors discovered data
sources to be proliferating. For instance:

Mass marketers discovered a growing proportion of customer interactions were


occurring outside what was covered by their existing systems, including CRM
(Customer Relationship Management) applications, call center management, or email
tracking. Increasingly, customers were communicating to them (or about them) on
social networks, and interacting through smart mobile devices.

Mobile telco carriers found their markets growing more complex. No longer restricted
to offering voice services, their networks were now being used by subscribers for
texting, web access, shooting and sharing videos, gaming, and making electronic
payments. To manage their networks, call detail records (CDRs) were no longer
adequate; they now had to track web logs, location-based sensory data, and social
network activity to better understand how, and with whom, their subscribers were
interacting to optimize their networks and solidify or build customer base.

Energy and natural resource companies have always worked with large volumes of
data. As they explore new sources in less accessible or environmentally sensitive
locations, they require even more data to make the right decision on where and how
to extract the resource. The stakes rise given that the new locations may require
drilling thousands of feet below the surface to access new oil or gas finds; data is the
hedge to ensure that their expensive bets prove out. For instance, one leading
offshore driller is charting seismic data in five dimensions to identify the best locations
to drill for oil.

In each case, the volume of data and varying structure of data would overwhelm the capability of
conventional SQL platforms and exhaust the patience and budgets of IT organizations responsible
for maintaining them. Furthermore, the SQL query language does not lend itself well to complex
text, sentiment, time series, or abstract algorithmic analyses. While SQL continues to deliver value
and carries a large skills base, there was the need of a new generation of platforms to supplement
SQL picking up where conventional data warehousing systems left off.

Page 6

Figure 3. Hadoop as Data Transformation platform

Source: Ovum

Hadoops secret: Versatility


As analytic/data warehousing platform, Hadoop provides the scale, flexibility, and economic
storage that picks up where traditional relational data warehouse platforms leave off. While
Hadoop does not eliminate the need to model or structure data, it provides the freedom for
organizations to allow the model to evolve as they gain better understanding of their data, or their
competitive environment or analytic problems change. It provides a scalable, economic platform
where data transformation cycles are performed, offloading the burden from SQL systems as
shown in Figure 3.
The obvious benefits of Hadoop include scalability, with near linear performance documented up to
the thousands of nodes, along with the cost advantages of leveraging commodity processor, disk,
and network interconnects. Yet the flexibility of the Hadoop platform could yield its most profound
benefits, as it can be utilized for multiple purposes such as:

Batch processing for moving data into downstream systems, such as a data
warehouse or mart;

SQL querying for exploring and understanding your data through familiar BI tools;

Applying data mining and machine learning libraries for gaining new insights from your
data and driving competitive advantage; and

Providing the ability to extend the platform for things vendors havent built.

Page 7

Enterprises get the best of both worlds. They gain the flexibility and scalability of a platform that
can accept heterogeneous data and support a late-binding approach to modeling data. They also
gain the familiarity of the SQL environment, making Hadoop accessible to the existing base of SQL
developers and BI tools. And with a platform that accepts familiar SQL queries and new forms of
analytics, organizations gain a platform that will evolve with their business needs.
Scalable data processing
Hadoop can perform large data manipulation and transformation tasks at lower cost compared to
most commercial data warehousing platforms. It allows the running of highly complex problems
involving very large sets of data that would otherwise be cost-prohibitive to run on SQL data
warehouses. Similarly, it allows offloading of highly compute-intensive tasks, allowing more
expensive SQL platforms to focus on the highest value workloads. As noted earlier, data
transformation is an excellent candidate, as Hadoop allows enterprises to tap the power of
MapReduce processing to perform these operations. With data transformation offloaded to
Hadoop, your SQL data warehouse can accept more queries, run more analytics problems, and
generate more reports.
Extensibility
Hadoops capability for storing data in volume and with high variety (variable structure) supports
extensibility in several ways.
First, the ability for retaining large volumes of data inexpensively allows raw data and transformed
data to be kept side-by-side, because the storage of data is inexpensive. That allows data schema
to evolve over time as new sources of data become available, operational parameters or key
performance indicators evolve, or the competitive landscape changes. This is a sharp contrast to
relational platforms, where the database is modeled in advance based on anticipated queries, and
taken offline if and when the data models are changed.
There are numerous examples underscoring the benefit of allowing data models to evolve with the
business. For instance, mobile carriers can enrich call detail records with other interactions such
as web logs as their business changes, or logistics providers can supplement auto ID data with
GPS readings to gain more granular snapshots of their transport operations.
Secondly, the ability to accommodate variably structured data allows organizations to gain visibility
into data and data sources that were traditionally outside the reach of SQL data warehouses. For
instance, customer data can now be enriched with social media data without taking the system
down; similarly, operational data can be enriched with data from new sensors that are coming
online. As data sources change, their impact on schema can be tested in sandboxed mode without
impacting the online data set, then brought into production on a staged basis.;
Staging
Although Hadoop was designed to accommodate variably structured data, that doesnt eliminate
the need for transformation processes. In the enterprise, Hadoops initial role has been as data
refining platform; typically data is transformed by coding MapReduce programs to perform the

Page 8

task. With commercial ETL tools supporting native operation inside Hadoop, generation of these
MapReduce data transformation routines is becoming automated.
Hadoop as staging platform has additional benefits. As a lower cost environment, Hadoop is wellsuited for moving resource-intensive transform cycles away from more expensive SQL data
warehousing environments. Additionally, Hadoops fault-tolerant architecture adds more reliability
and recoverability to the process.
As a lower cost platform, Hadoop is well-suited for performing the transformation and loading
operations that would otherwise run on more expensive, relational platforms. It is also more
reliable; the fault-tolerant architecture built into Hadoop enables transformation jobs to get
completed, even if some nodes go down. More importantly, if you must change or correct your ETL
routine after the fact, you have several options: you can temporarily mobilize additional nodes to
perform the corrective work, or you can re-run the entire transform from scratch, because Hadoop
is scalable and inexpensive enough to store the original raw data side by side with the transformed
copy.
Flexible division of labor
By performing data transformations in Hadoop, your organization can get the best of both worlds:
access to the inexpensive processing power of Hadoop and the familiarity of their existing BI/data
warehousing environments. It can also gain flexibility regarding placement of data; it can remain in
Hadoop or be moved only when necessary. Hadoop allows your organization to take advantage of
the flexibility that results when data transformation is decoupled from data movement. Once data is
transformed inside Hadoop, you can rationalize the division of labor, matching analytic workloads
to the right target. Some of the options include:

Running time-consuming, resource-intensive analytic workloads inside Hadoop while


reserving routine query, analytics, and reporting for the data warehouse or data mart;

Querying Hadoop directly, thanks to the capabilities of most commercial BI tools to


read Hive metadata; or

Querying Hadoop interactively, thanks to emerging frameworks that provide highperformance alternatives to MapReduce.

With Hadoop paired alongside your existing data warehouse/data mart infrastructure, your
organization can keep its options open regarding how, where, and when it runs analytics. Your
organization can leverage the strength of each platform, while enabling it to take advantage of new
processing frameworks and capabilities that are becoming available on Hadoop. With Hadoop,
your organization can avoid facing the trade-off between transform compute cycles and data
richness.
Active archiving
Hadoop enables organizations to take maximum advantage of the ongoing decline in storage costs
to all forms of media. Although Hadoop itself will not replace archiving media or processes, it
provides a cost-effective alternative for keeping older data from data warehouse and OLTP

Page 9

systems, which would otherwise be moved to archive, online. This can prove especially valuable
for:

Credit providers, with long-term customer relationships, who gain better visibility over
spending patterns across multiple economic cycles;

Financial services firms, who gain the capability to uncover long-running patterns of
fraud or security breaches when examining data covering extended periods;

Healthcare organizations including life sciences companies, payers, and care


providers who analyze long-term patient outcomes; and

Climatological research bodies, who gain full visibility over long-term meteorological
data, yielding insights on climate trends and its impacts on different sectors of the
economy.

Data exploration
Hadoops ability to store variably-structured data allows it to ingest information and feeds from a
much broader array of sources. This allows exploration and analysis of data traditionally off-limits
to relational systems, either because of its size, complex structure, or variability or all of the
above. For instance, Hadoop allows data types such as satellite imagery or acoustical sounds to
be queried. It provides a scalable platform for running powerful frameworks that can profile,
decipher structure from, and/or apply structure to unstructured or variably structured data.
Hadoop allows programmatic parsing and analysis of the data for discovering potentially significant
statistical patterns that can yield valuable insight on an organizations competitive landscape; its
customers; their usage and preferences for its products; and other phenomena. Exploiting the
Variety V of Big Data, Hadoop can be used as a platform for data scientists and statisticians to
freely explore data to identify new analytic problems; uncover counter-intuitive insights; and
discover the implicit structure and importance of little-known data sets.
Data exploration in Hadoop can also make business analysts more productive with traditional data
modeling processes. Hadoops flexible schema-on-read capability allows SQL engines such as
Impala to present raw data for quick viewing in a familiar self-service BI tool. There, analysts can
iteratively refine and cleanse the data to develop a star schema and identify what data should be
pushed out to data warehouses or data marts.

Platform evolution
Hadoop is rapidly evolving to add the capabilities that enterprises expect from their data
management platforms. It is becoming a multi-faceted platform that can support a wide range of
processing patterns, and it is becoming more robust.
Processing approaches
From its roots as a platform for running MapReduce-style batch processing, numerous alternatives
are emerging to increase Hadoops versatility including:

Interactive SQL query, which makes Hadoop accessible to existing enterprise SQLbased query and reporting tools;
Page 10

Streaming data processing, which transforms Hadoop into a real-time platform with
capability to ingest and analyze high-velocity data;

Graph processing, which allows the analysis of many-to-many relationships (e.g.,


charting social groups and spheres of influence, managing telecommunications
networks, or optimizing evidence-based healthcare delivery);

Scientific computation, which enables processing of highly complex, multi-variant


problems; and

Tiering and in-memory processing, for optimizing access to hot data.

Hardening
With a rich, multi-decade heritage, the SQL platform has accumulated an extensive array of tools,
technologies, and practices for managing access and protecting data. Enterprises that are
integrating into their analytic data infrastructure expect no less of Hadoop.
Tools and technologies for securing, regulating data access, and overseeing data activity are
rapidly emerging in the Hadoop environment. A rich third party vendor ecosystem is evolving to
deliver value-add to the core open source platform, enabling:

Selective or full encryption of data stored in Hadoop;

Role-based access to the Hadoop platform;

Tracking of data lineage;

Monitoring and management of activity in the use and processing of Hadoop data;

Data recovery from unplanned outages and interruptions of activity; and

More robust data replication capability.

While not as mature as the SQL environment, the tools, technologies, practices, and technology
provider ecosystem for Hadoop are evolving rapidly.

HOW CLOUDERA SUPPORTS HADOOP ENTERPRISE


READINESS
Formed in 2008 by principals from Oracle, Yahoo, Facebook, Google, and other open source
projects, Cloudera pioneered commercial support for Hadoop, and has built a growing roster of
enterprise clients. It has been very active in making the platform more accessible to the enterprise
SQL developer base, and making it more manageable and secure.

Cloudera Impala: Making Hadoop SQL-friendly


Ovum believes that the hot spot for Hadoop development in 2013 is convergence with SQL. This is
essential because of the reality of SQL being an immutable fact of life in enterprise IT.
Cloudera has been an active player in making Hadoop SQL-friendly. It has long partnered with
leading ETL, BI, and Data warehousing platform and tool providers to offer connectivity between
Hadoop and SQL platforms. In turn, many of these technology providers are taking connectivity to

Page 11

new levels by extending their offerings to venture beyond interfacing to Hadoop to operating
natively within it.
Clouderas introduction of the Impala open source framework takes Hadoop-SQL convergence to
the next level. Impala, an Apache open source project developed by Cloudera, brings interactive
SQL query directly to Hadoop. It offers a high-performance, massively parallel processing
framework that works against any Hadoop file format. While Impala utilizes the Hive metadata
store, it provides a higher-performance alternative to relying on batch-oriented MapReduce and
Hive processes. Commercial support and management automation for Impala will soon be
released by Cloudera as an optional upgrade to Cloudera Enterprise under the Real-Time Query
(Cloudera Enterprise RTQ) subscription package.
The introduction of Impala adds new options for enterprises in optimizing their data warehousing
strategies. While Impala is not intended to replace your enterprise data warehouse, data mart, or
conventional BI reporting, it is well suited for exploratory analytics where new data sources can be
rapidly transformed and exposed to self-service BI. Impala also lends itself to helping business
analysts iterate modeling for data that may eventually be migrated to a data warehouse. New
frameworks such as Impala are broadening the Hadoop platform well beyond its batch processing
roots. It allows enterprises to broaden their options for analytics, using criteria such as cost,
scope, and delivery requirements for determining the right platform for the right query.

Manageability
Starting with deployment and system health
Cloudera has also been active in bringing to Hadoop the management capabilities that have long
been customary in the relational database world. It introduced Cloudera Manager, a tool for
automating deployment and monitoring the health of Hadoop operation. Cloudera developed the
management tool because open source Hadoop, developed as a series of components, was not a
centralized, cohesive piece of software that could be deployed with a single install.
Originally developed to simplify deployment and management at the system level, and enhance
Hadoop stability, Cloudera Manager automates configuration of Hadoop components; supports
development of deployment workflows; configures start and restart services; and monitors the
overall health of the system.
Recent enhancements focusing on uptime and data handling
Cloudera Navigator, a new feature of Cloudera Manager, tracks how data is utilized; specifically, it
compiles an audit trail detailing what operations were performed against specific pieces of data, by
whom, and when. In its initial release, Navigator will track activity against HDFS, Hive, and HBase.
This capability can be helpful, both for data exploration and security. By understanding how data is
consumed, organizations can understand its value and role in supporting specific business
activities. Understanding data utilization provides can help your organization pinpoint what policies
or security measures should be associated with specific categories of data. Another new feature in
Cloudera Manager adds support for rolling upgrades, enabling side-by-side deployment of patches

Page 12

without the need to take the entire Hadoop instance offline. This capability can help IT
organizations significantly reduced the amount of scheduled downtime.
Cloudera has also recently introduced a new Backup and Disaster Recovery module (Cloudera
Enterprise BDR) to automate recovery processes. BDR makes Hadoop replication processes more
robust by ensuring that metadata is also replicated. In so doing, it not ensures that data and
metadata remain in sync, but that system recoveries can be executed consistently.

RECOMMENDATIONS FOR ENTERPRISES


Hadoop can change the economics of managing data
As an open source platform designed to run on industry standard hardware, Hadoop could
significantly lower the costs of data management. It will not replace your data warehousing and BI
infrastructure, but augment it. Use Hadoop for:

Eliminating contention between operational support (such as data profiling and


transformation) and analytics/reporting;

Eliminating contention between ad hoc vs. standard query and reporting activity;

Incorporating new sources of data to broaden your analytics insights;

Introducing exploratory analytics without jeopardizing SLAs for routine


analytic/reporting activity;

Implementing Active Archiving that derives new value from aging data by providing a
cost-effective alternative for keeping such data online; and/or

Eliminating or reducing the need to continually augment data warehouses with new
costly technology infrastructure simply to meet SLAs

Use Hadoop to evolve your analytics capabilities


Hadoop provides the best of both worlds: support of new programmatic analytics styles, and the
familiarity and responsiveness of interactive SQL access. If your organization only has SQL
expertise, your organization does not have to suddenly learn skills on new frameworks such as
MapReduce; it can use Hadoop to run exploratory analytics using familiar SQL BI tools. Take
advantage of capabilities to model data on demand to ask new questions or evaluate them for
future use in your data warehouse. From there, your organization can begin learning new
approaches, such as MapReduce or other processing frameworks that are emerging for the
Hadoop platform, to yield valuable new insights on your business.

Hadoop is rapidly becoming enterprise robust


Ovum believes that for Hadoop to penetrate the enterprise, it must become a first-class citizen with
IT, the data center, and the business. That means Hadoop must:

At the IT organization level, leverage and extend, rather than force replacement of
existing IT skillsets. Playing friendly with the large base of enterprise SQL expertise

Page 13

has become a hot spot for vendor activity. With its existing partnerships and new
initiatives such as Impala, Cloudera has been actively pursuing this goal.

At the data center level, coexist with IT infrastructure, interoperating and integrating
with existing BI and data warehousing platforms and tools.

At the enterprise level, support running of existing (SQL-oriented) analytic workloads,


while clearing the path for organizations to gradually learn their way to benefitting from
new forms of analytic processing.

Clearly, the Hadoop platform and the capabilities for hardening it for the enterprise are works in
progress. SQL already has a 30-year head start when it comes to development of data platforms
along with tooling for integrating data, profiling it, cleansing it, safeguarding access to it, and
implementing a policy-driven lifecycle for it. However, development of core capabilities for making
Hadoop enterprise-ready is ramping up fast. The emergence of Hadoop is following a script that is
similar to what occurred with BI and data warehousing back in the mid-1990s, where tooling and
practices matured within a 2 3 year period. History is repeating itself with Hadoop. For instance,
today there are tools that track data activity; tomorrow there will be tools that selectively provide
role-based access.

Start planning now


Look to the emergence of BI and data warehousing for a good example of how to become familiar
with Hadoop. Most organizations started small with departmental data marts to address a point of
pain, and built from there.
For Hadoop, select a pressing issue where the results will have visible impact; there is no single
use case for implementing Hadoop. Drivers could range from the urgency for ingesting a new form
of data, such as social media, to extend your 360 view of the customer. Alternatively, it could be
the need to better utilize the data your organization already has, such as active archiving that
derives new value from the data you already have.
As a new platform, your IT organization will have to learn some new skills. But it can be an
evolutionary transition, where improving and widening the scope of its existing analytics is your
logical first step.

Author
Tony Baer, Principal Analyst, Ovum IT Enterprise Solutions
tony.baer@ovum.com

Ovum Consulting
We hope that this analysis will help you make informed and imaginative business decisions. If you
have further requirements, Ovums consulting team may be able to help you. For more information
about Ovums consulting capabilities, please contact us directly at consulting@ovum.com.

Page 14

Disclaimer
All Rights Reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form by any means, electronic, mechanical, photocopying, recording, or otherwise, without the
prior permission of the publisher, Ovum (an Informa business).
The facts of this report are believed to be correct at the time of publication but cannot be
guaranteed. Please note that the findings, conclusions, and recommendations that Ovum delivers
will be based on information gathered in good faith from both primary and secondary sources,
whose accuracy we are not always in a position to guarantee. As such Ovum can accept no
liability whatever for actions taken based on any information that may subsequently prove to be
incorrect.

Page 15

Vous aimerez peut-être aussi