6 Issues That Can Derail Your Big Data Initiative

6 Issues That Can Derail Your Big Data
Initiative
How to Get Hadoop Data Management on the Right Track
A White Paper
WebFOCUS
iWay Software
Omni
Table of Contents
1
Introduction
Hadoop Is More Complex and Expensive Than You Think
Dont Drown in the Data Ponds
Hadoop Is Not an Integration Tool
Hadoop Is Not a Relational Database Management System
Structured Data Is Just as Important as Unstructured Information
Raw Data Must Be Refined Before Its Moved Into Hadoop
8 Conclusion
Introduction
A recent survey shows that 55 percent of big data projects are never finished, due to inaccurate
scope, technical roadblocks, and data silos.1 Misconceptions, mistakes, and poor planning can
negatively impact deployments by wasting time and resources, hindering performance, and
delaying return on investment.
In this white paper, well highlight six issues you need to account for to get the most value from
your Hadoop ecosystem:
Hadoop is more complex and expensive than you think
Dont drown in the data ponds
Hadoop is not an integration tool
Hadoop is not a relational database management system (RDBMS)
Structured data is just as important as unstructured information
Raw data needs to be refined before you move it into Hadoop
Well discuss best practices that can help you avoid some of the most common mistakes made
during Hadoop rollouts, so you can put your big data initiative on the path to success from
the start.
Kaskade, Jim. CIOs & Big Data: What Your IT Team Wants You to Know, infochimps, January 2013.
Information Builders
Hadoop Is More Complex and Expensive Than You Think

Its a common fallacy that, no matter what your big data management issue, Hadoop can fix it
quickly, with minimal investment. But getting up and running with Hadoop is not as simple as it
seems. There are talent shortages, hardware requirements, and hidden costs you probably havent
anticipated.
Programming in Hadoop is complex, calling for experience and a deep understanding of how to
develop Mappers, Reducers, Partitioners, Combiners, and more. All Hadoop applications are made
up of numerous tasks, each comprising countless steps. Finding experienced and knowledgeable
Hadoop talent to tackle these activities proves challenging for up to 80 percent
of CIOs.2
One of the biggest obstacles to the adoption of Hadoop in the enterprise is the shortage of
professionals trained to work with it, claims a recent CMSWire article. According to job posting
aggregator Indeed.com, theres been as much as a 225,000 percent growth in demand for the
big data-crushing skill since 2009 and no one is schooling engineers at that rate.3
A data integration tool will reduce coding, and help accelerate the movement of
data into and out of Hadoop.
On top of expensive and hard-to-find skills for Hadoop installation and support, there are related
software and hardware needs to consider. These costs, while certainly lower than traditional big
data storage methods, add up quickly often totaling far more than originally anticipated.
A powerful and broad-reaching data integration framework can help. Initially, a data integration
tool may require some additional investment. But down the road it will allow for much faster (and
far less expensive) execution of Hadoop initiatives by accelerating the movement of data into
Hadoop and drastically reducing the amount of coding required to manage it. Furthermore, data
integration talent is easier to find, and far more affordable, than the Hadoop, Hbase, and other
developers who would be otherwise needed.
Kaskade, Jim. CIOs & Big Data: What Your IT Team Wants You to Know, infochimps, January 2013.
Backaitis, Virginia. Big Data Skills Shortage? Not on MapRs (Pre-IPO) Watch, CMSWire, January 2015.
6 Issues That Can Derail Your Big Data Initiative
Dont Drown in the Data Ponds

Dealing with multiple versions of the truth is a long-standing problem created by fragmented
and siloed information systems. Its an issue that many organizations hope to solve through their
Hadoop deployments.
But some companies allow individual departments to create their own mini-repositories to
support data analysis. By setting up a series of smaller data ponds, rather than a single, enterprisewide data lake, youre simply creating new silos.
That doesnt sound bad at first, but with different extracts and ways of slicing and dicing the data,
you end up with different views of the data, states big data consultant Andrew C. Oliver. I dont
mean flat versus cube I mean different answers for some of the same questions.4
If data ponds are a must, use data quality management, master data
management, and data governance technologies to mitigate the risks.
These individual ponds can become a data analysis nightmare. Now your bold plan to answer
questions, secure data in a central place, and reduce costs is a bit of a quagmire of even more
disparate systems and organizational power plays, Oliver claims in a separate article. Instead
of one data lake, in which all needs are filled, you have multiple data ponds in which only a few
needs can be filled.5
Your goal should be a single repository of big data regardless of role or department. If data
ponds are a must, derive them from your data lake. That will help you implement comprehensive
data quality management, master data management, and data governance policies, supported by
robust technologies, that will mitigate the associated risks.
Oliver, Andrew C. The 10 Worst Big Data Practices, InfoWorld, July 2014.
Oliver, Andrew C. Data Lakes, Data Castles, and Data Ponds, LinkedIn, September 2014.
Hadoop Is Not an Integration Tool

All Hadoop initiatives include requirements for integrating enterprise data, and ensuring its
integrity. Raw data needs to be collected and transformed into information that is ready for widescale consumption.
While Hadoop offers some integration-type capabilities, it is far from being an end-to-end solution
for unifying data and managing its quality. Youll need a way to get enterprise data into Hadoop, as
well as to connect your Hadoop repositories with other enterprise sources so data can be pulled
back out. Youll also need to ensure the accuracy, completeness, and consistency of your big data.
This requires more robust and full-featured data integration, complete with features for
transformation, metadata management, and broad connectivity to any information asset, as
well as data quality and master data management. Without the right supporting integration
infrastructure in place, Hadoop becomes little more than another lone silo of information.
There are plenty of integration tools; many dont run natively in Hadoop. Find one
that is tightly integrated with Hadoop.
While there are plenty of integration tools to choose from, many dont run natively in Hadoop.
Solutions that run outside the Hadoop environment result in thrashing and inefficient data
movement between the integration and data integrity tools and the Hadoop data stores. The
highest levels of performance in data movement and quality management are achieved when
integration and data integrity requirements are addressed from within the Hadoop ecosystem, so
look for a platform that is tightly integrated with Hadoop.
Hadoop Is Not a Relational Database Management System

Many big data initiatives call for unstructured data to be combined with more traditional
structured information, such as that contained in a relational database management system
(RDBMS). And while Hadoop offers flexibility, scalability, and cost-efficiency benefits over an
RDBMS when processing massive amounts of data, it should not serve as a replacement for one.
Yet many organizations mistakenly assume that Hadoop will perform like a traditional RDBMS.
Youll need to know when your RDBMS has to do the heavy lifting, and when to rely on Hadoop.
Hadoop is not a database; it is built to handle the big data your RDBMS cant.
Since Hadoop is not a database and is not architected to support an indexed query, youll still
need your RDBMS for online transaction processing (OLTP) or any other any high-performance,
time-sensitive data analysis requirements that are not driven by large, unstructured data sets (e.g.,
analyzing small or mid-sized data sets in real time). Hadoop, on the other hand, is a framework
specifically built to handle what your RDBMS cant processing and analyzing large volumes of
structured and semi-structured data.
Furthermore, Hadoop users tend to be do-it-yourself types, so you may find value in solutions
that simplify Hadoop usage. Emerging technologies allow ordinary people with limited
programming skills to use Hadoop without the need to understand its underlying complexities.
This makes it easier to manage skills, contain costs, and mitigate risks when top developers leave,
or non-Hadoop developers make errors.
Structured Data Is Just as Important as Unstructured Information

Many organizations deploy Hadoop to get a handle on their unstructured information
documents, audio and video files, blog posts, social media text, etc. But unstructured content is
just a subset of enterprise data. Focusing all efforts solely on collecting, storing, and analyzing
unstructured information, at the expense of intelligence contained within structured data, will
lead to incomplete and ineffective insights.
Driving value from big data is about gleaning insight from all the information contained in all
available assets. But while organizations are only using 31 percent of their unstructured data and
28 percent of their semi-structured data, structured data also remains an untapped resource, with
only about 40 percent of it being leveraged for strategic decision-making.6
Hadoop cannot stand alone if you want to analyze structured and unstructured
information together. It must connect to other systems and sources.
The CMOs and CIOs I talk with agree that we cant exclude traditional data derived from product
transaction information, financial records, and interaction channels, such as the call center and
point-of-sale, says Forbes contributor Lisa Arthur. All of that is big data, too, even though it may
be dwarfed by the volume of digital data thats now growing at an exponential rate.7
The Hadoop environment cannot stand alone if you want to analyze structured and unstructured
information together. It must seamlessly connect to other systems and sources.
An integration platform that leverages all assets in the enterprise and provides broad information
reach is very important. Since information infrastructures are often diverse, an integration solution
must be able to share data bi-directionally between Hadoop and relational databases, packaged
applications, files, mainframe systems, and other sources. Theres also machine-generated data
to consider, which will grow to a projected 40 percent by 2020, with an estimated 200 billion
connected devices in use.8
Evelson, Boris. Make Your BI Environment More Agile With BI on Hadoop, Forrester, August 2015.
Arthur, Lisa. What Is Big Data?, Forbes, August 2013.
EMC Digital Universe Study, EMC and IDC, December 2012.
Raw Data Must Be Refined Before Its Moved Into Hadoop

Data quality is important in any scenario, but even more so when big data is involved. Given the
high volume of data being processed, the vast number of sources it comes from, and the varied
formats in which it exists, the potential for quality issues within Hadoop repositories is high. Yet
many organizations ignore the need to maintain big data integrity.
When big data is dirty, stakeholders are less likely to trust it to support planning and decisionmaking.
When data are unreliable, managers quickly lose faith in them and fall back on their intuition
to make decisions, steer their companies, and implement strategy, states data consultant
Thomas C. Redman in a recent article. They are, for example, much more apt to reject important,
counterintuitive implications that emerge from big data analyses.9
The potential for quality issues within Hadoop is high, yet many organizations
ignore the need to maintain big data integrity.
Inaccuracies, inconsistencies, and other problems make it extremely difficult for users to work
with big data. Yet, cleansing and standardizing massive data sets can be an overwhelming task,
especially if data quality is approached manually, without supporting technologies in place.
Understanding your data and how its used through profiling can help. Data that is heavily
used and widely shared may demand extensive quality management and governance. Less critical
information, however, may not require so much work. Once youve identified the data sets that
need the most attention, you can develop a plan to manage and ensure their integrity.
Redman, Thomas C. Datas Credibility Problem, Harvard Business Review, December 2013.
Conclusion
So what can you do to avoid common mistakes and keep your Hadoop initiative on the right
track?
Know what youre getting into, and proactively streamline your Hadoop project. If youre
aware of and prepared for the cost and complexity associated with Hadoop, you can proactively
make sure your project stays on schedule and within budget. Leveraging data integration tools is a
great way to accelerate the data movement portion of your project, and keep expenses down.
Work towards a single view of your data, even if you need data ponds. Data ponds can
cause many problems in Hadoop environments. But if you decide that smaller data repositories
are necessary, you can reduce the risks by applying data quality and master data management
technologies to ensure the consistency and accuracy of information across those sources.
Use a separate integration tool. Advanced integration capabilities are a core part of your big
data strategy and Hadoop wont provide what you need. Bring in a third-party integration solution
that can run natively in the Hadoop environment to complement your Hadoop implementation
and promote best practices in data movement and unification. The integration platform you
choose should reduce coding and maintenance. A tool that allows you to configure rather than
code will simplify implementation and administration, and save time and money. Furthermore,
an integration solution should be vendor-agnostic. If you need to switch big data vendors for any
reason, youll want to minimize the impact on Hadoop.
Use your RDBMS and Hadoop. Do not view your Hadoop environment as a replacement for
your RDBMS. Hadoop is designed to handle the processing and analysis of large volumes of
structured and unstructured data. It is not suited for the types of transaction processing that an
RDBMS handles well high-performance, real-time queries using small or mid-sized data sets.
Hadoop should reflect all your data. Big data is not just about your unstructured information.
The combination of both structured and unstructured data achieved through powerful
enterprise integration with access to the broadest array of sources will give your stakeholders the
most complete view of your business, driving the insights needed to boost performance.
Maintain ongoing data integrity. Your ability to exploit big data for competitive advantage
depends on its consistency and accuracy. Data quality and master data management technologies
will ensure the information in your Hadoop environment is fit for purpose at all times.
As data volumes grow dramatically, so does the need for effective information management. iWay
data integrity and integration solutions from Information Builders help you derive the greatest
possible value from your big data analytic repositories. Our technologies work seamlessly in
Hadoop environments, ensuring the quality, consistency, completeness, and availability of even
the largest volumes of data.
Worldwide Offices
Corporate Headquarters
International
Two Penn Plaza

New York, NY 10121-2898
(212) 736-4433
(800) 969-4636
Australia*
Melbourne 61-3-9631-7900
Sydney 61-2-8223-0600
United States
Atlanta, GA* (770) 395-9913
Boston, MA* (781) 224-7660
Channels (770) 677-9923
Charlotte, NC (980) 215-8416
Chicago, IL* (630) 971-6700
Cincinnati, OH* (513) 891-2338
Dallas, TX* (972) 398-4100
Denver, CO* (303) 770-4440
Detroit, MI* (248) 641-8820
Federal Systems, D.C.* (703) 276-9006
Florham Park, NJ (973) 593-0022
Houston, TX* (713) 952-4800
Los Angeles, CA* (310) 615-0735
Minneapolis, MN* (651) 602-9100
New York, NY* (212) 736-4433
Philadelphia, PA* (610) 940-0790
Pittsburgh, PA (412) 494-9699
San Jose, CA* (408) 453-7600
Seattle, WA (206) 624-9055
St. Louis, MO* (636) 519-1411, ext. 321
Washington, D.C.* (703) 276-9006
Austria Raffeisen Informatik Consulting GmbH

Wien 43-1-211-36-3344
Brazil
So Paulo 55-11-3372-0300
Canada
Calgary (403) 718-9828
Montreal* (514) 421-1555
Toronto* (416) 364-2760
Vancouver (604) 688-2499
China
Peacom, Inc.
Fuzhou 86-15-8800-93995
SolventoSOFT Technology (HK) Limited
Hong Kong 852-9802-4757
Middle East
Barmajiat Information Technology, LLC
Dubai 971-4-420-9100
n Bahrain n Kuwait n Oman n Qatar
n Saudi Arabia n United Arab Emirates (UAE)
Innovative Corner Est.
Riyadh 966-1-2939007
n Iraq n Lebanon n Oman
Saudi Arabia
UAE
Netherlands*
Amstelveen 31 (0)20-4563333
n Belgium
n Luxembourg
Nigeria InfoBuild Nigeria
Garki-Abuja 234-9-290-2621
Norway InfoBuild Norge AS c/o Okonor
Tynset 358-0-207-580-840
Portugal
Lisboa 351-217-217-400
Czech Republic InfoBuild Software CE s.r.o.

Praha 420-234-234-773
Singapore Automatic Identification Technology Ltd.

Singapore 65-69080191/92
Estonia InfoBuild Estonia

Tallinn 372-618-1585
South Africa InfoBuild (Pty) Ltd.

Johannesburg 27-11-064-5668
Finland InfoBuild Oy
Espoo 358-207-580-840
South Korea
Dfocus Co., Ltd.
Seoul 02-3452-3900
France*
Suresnes +33 (0)1-49-00-66-00
Germany
Eschborn* 49-6196-775-76-0
Greece Applied Science Ltd.
Athens 30-210-699-8225
Guatemala IDS de Centroamerica
Guatemala City (502) 2412-4212
India* InfoBuild India
Chennai 91-44-42177082
Israel SRL Software Products Ltd.
Petah-Tikva 972-3-9787273
Italy
Agrate Brianza 39-039-59-66-200
Japan KK Ashisuto
Tokyo 81-3-5276-5863
UVANSYS, Inc.
Seoul 82-2-832-0705
Southeast Asia Information Builders SEAsia Pte. Ltd.
Singapore 60-172980912
n Bangladesh n Brunei n Burma n Cambodia
n Indonesia n Malaysia n Papua New Guinea
n Thailand n The Philippines n Vietnam
Spain
Barcelona 34-93-452-63-85
Bilbao 34-94-400-88-05
Madrid* 34-91-710-22-75
Sweden
Stockholm 46-8-76-46-000
Switzerland
Wallisellen 41-44-839-49-49
Lithuania InfoBuild Lithuania, UAB

Vilnius 370-5-268-3327
Taiwan
Azion Corporation
Taipei 886-2-2356-3996
Galaxy Software Services, Inc.
Taipei 886-2-2586-7890, ext. 114
Mexico
Mexico City 52-55-5062-0660
United Kingdom*
Uxbridge Middlesex 44-20-7107-4000
Latvia InfoBuild Lithuania, UAB

Vilnius 370-5-268-3327
Venezuela InfoServices Consulting

Caracas 58212-763-1653
West Africa InfoBuild FSA
Abidjan 225-01-17-61-15
* Training facilities are located at these offices.
Corporate Headquarters Two Penn Plaza, New York, NY 10121-2898 (212) 736-4433 Fax (212) 967-6406
Connect With Us
informationbuilders.com askinfo@informationbuilders.com
DN7508337.0416
Copyright 2016 by Information Builders. All rights reserved. [137] All products and product names mentioned in this publication are
trademarks or registered trademarks of their respective companies.

6 Issues That Can Derail Your Big Data Initiative

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

6 Issues That Can Derail Your Big Data Initiative

Transféré par

Droits d'auteur :

Formats disponibles

6 Issues That Can Derail Your Big Data

Hadoop Is More Complex and Expensive Than You Think

Dont Drown in the Data Ponds

Hadoop Is Not an Integration Tool

Hadoop Is Not a Relational Database Management System

Structured Data Is Just as Important as Unstructured Information

Raw Data Must Be Refined Before Its Moved Into Hadoop

Hadoop is more complex and expensive than you think

Dont drown in the data ponds

Hadoop is not an integration tool

Hadoop is not a relational database management system (RDBMS)

Structured data is just as important as unstructured information

Raw data needs to be refined before you move it into Hadoop

Hadoop Is More Complex and Expensive Than You Think