Académique Documents
Professionnel Documents
Culture Documents
Initiative
How to Get Hadoop Data Management on the Right Track
A White Paper
WebFOCUS
iWay Software
Omni
Table of Contents
1
Introduction
8 Conclusion
Introduction
A recent survey shows that 55 percent of big data projects are never finished, due to inaccurate
scope, technical roadblocks, and data silos.1 Misconceptions, mistakes, and poor planning can
negatively impact deployments by wasting time and resources, hindering performance, and
delaying return on investment.
In this white paper, well highlight six issues you need to account for to get the most value from
your Hadoop ecosystem:
Well discuss best practices that can help you avoid some of the most common mistakes made
during Hadoop rollouts, so you can put your big data initiative on the path to success from
the start.
Kaskade, Jim. CIOs & Big Data: What Your IT Team Wants You to Know, infochimps, January 2013.
Information Builders
A data integration tool will reduce coding, and help accelerate the movement of
data into and out of Hadoop.
On top of expensive and hard-to-find skills for Hadoop installation and support, there are related
software and hardware needs to consider. These costs, while certainly lower than traditional big
data storage methods, add up quickly often totaling far more than originally anticipated.
A powerful and broad-reaching data integration framework can help. Initially, a data integration
tool may require some additional investment. But down the road it will allow for much faster (and
far less expensive) execution of Hadoop initiatives by accelerating the movement of data into
Hadoop and drastically reducing the amount of coding required to manage it. Furthermore, data
integration talent is easier to find, and far more affordable, than the Hadoop, Hbase, and other
developers who would be otherwise needed.
Kaskade, Jim. CIOs & Big Data: What Your IT Team Wants You to Know, infochimps, January 2013.
Backaitis, Virginia. Big Data Skills Shortage? Not on MapRs (Pre-IPO) Watch, CMSWire, January 2015.
If data ponds are a must, use data quality management, master data
management, and data governance technologies to mitigate the risks.
These individual ponds can become a data analysis nightmare. Now your bold plan to answer
questions, secure data in a central place, and reduce costs is a bit of a quagmire of even more
disparate systems and organizational power plays, Oliver claims in a separate article. Instead
of one data lake, in which all needs are filled, you have multiple data ponds in which only a few
needs can be filled.5
Your goal should be a single repository of big data regardless of role or department. If data
ponds are a must, derive them from your data lake. That will help you implement comprehensive
data quality management, master data management, and data governance policies, supported by
robust technologies, that will mitigate the associated risks.
Oliver, Andrew C. The 10 Worst Big Data Practices, InfoWorld, July 2014.
Oliver, Andrew C. Data Lakes, Data Castles, and Data Ponds, LinkedIn, September 2014.
Information Builders
There are plenty of integration tools; many dont run natively in Hadoop. Find one
that is tightly integrated with Hadoop.
While there are plenty of integration tools to choose from, many dont run natively in Hadoop.
Solutions that run outside the Hadoop environment result in thrashing and inefficient data
movement between the integration and data integrity tools and the Hadoop data stores. The
highest levels of performance in data movement and quality management are achieved when
integration and data integrity requirements are addressed from within the Hadoop ecosystem, so
look for a platform that is tightly integrated with Hadoop.
Hadoop is not a database; it is built to handle the big data your RDBMS cant.
Since Hadoop is not a database and is not architected to support an indexed query, youll still
need your RDBMS for online transaction processing (OLTP) or any other any high-performance,
time-sensitive data analysis requirements that are not driven by large, unstructured data sets (e.g.,
analyzing small or mid-sized data sets in real time). Hadoop, on the other hand, is a framework
specifically built to handle what your RDBMS cant processing and analyzing large volumes of
structured and semi-structured data.
Furthermore, Hadoop users tend to be do-it-yourself types, so you may find value in solutions
that simplify Hadoop usage. Emerging technologies allow ordinary people with limited
programming skills to use Hadoop without the need to understand its underlying complexities.
This makes it easier to manage skills, contain costs, and mitigate risks when top developers leave,
or non-Hadoop developers make errors.
Information Builders
Hadoop cannot stand alone if you want to analyze structured and unstructured
information together. It must connect to other systems and sources.
The CMOs and CIOs I talk with agree that we cant exclude traditional data derived from product
transaction information, financial records, and interaction channels, such as the call center and
point-of-sale, says Forbes contributor Lisa Arthur. All of that is big data, too, even though it may
be dwarfed by the volume of digital data thats now growing at an exponential rate.7
The Hadoop environment cannot stand alone if you want to analyze structured and unstructured
information together. It must seamlessly connect to other systems and sources.
An integration platform that leverages all assets in the enterprise and provides broad information
reach is very important. Since information infrastructures are often diverse, an integration solution
must be able to share data bi-directionally between Hadoop and relational databases, packaged
applications, files, mainframe systems, and other sources. Theres also machine-generated data
to consider, which will grow to a projected 40 percent by 2020, with an estimated 200 billion
connected devices in use.8
Evelson, Boris. Make Your BI Environment More Agile With BI on Hadoop, Forrester, August 2015.
The potential for quality issues within Hadoop is high, yet many organizations
ignore the need to maintain big data integrity.
Inaccuracies, inconsistencies, and other problems make it extremely difficult for users to work
with big data. Yet, cleansing and standardizing massive data sets can be an overwhelming task,
especially if data quality is approached manually, without supporting technologies in place.
Understanding your data and how its used through profiling can help. Data that is heavily
used and widely shared may demand extensive quality management and governance. Less critical
information, however, may not require so much work. Once youve identified the data sets that
need the most attention, you can develop a plan to manage and ensure their integrity.
Redman, Thomas C. Datas Credibility Problem, Harvard Business Review, December 2013.
Information Builders
Conclusion
So what can you do to avoid common mistakes and keep your Hadoop initiative on the right
track?
Know what youre getting into, and proactively streamline your Hadoop project. If youre
aware of and prepared for the cost and complexity associated with Hadoop, you can proactively
make sure your project stays on schedule and within budget. Leveraging data integration tools is a
great way to accelerate the data movement portion of your project, and keep expenses down.
Work towards a single view of your data, even if you need data ponds. Data ponds can
cause many problems in Hadoop environments. But if you decide that smaller data repositories
are necessary, you can reduce the risks by applying data quality and master data management
technologies to ensure the consistency and accuracy of information across those sources.
Use a separate integration tool. Advanced integration capabilities are a core part of your big
data strategy and Hadoop wont provide what you need. Bring in a third-party integration solution
that can run natively in the Hadoop environment to complement your Hadoop implementation
and promote best practices in data movement and unification. The integration platform you
choose should reduce coding and maintenance. A tool that allows you to configure rather than
code will simplify implementation and administration, and save time and money. Furthermore,
an integration solution should be vendor-agnostic. If you need to switch big data vendors for any
reason, youll want to minimize the impact on Hadoop.
Use your RDBMS and Hadoop. Do not view your Hadoop environment as a replacement for
your RDBMS. Hadoop is designed to handle the processing and analysis of large volumes of
structured and unstructured data. It is not suited for the types of transaction processing that an
RDBMS handles well high-performance, real-time queries using small or mid-sized data sets.
Hadoop should reflect all your data. Big data is not just about your unstructured information.
The combination of both structured and unstructured data achieved through powerful
enterprise integration with access to the broadest array of sources will give your stakeholders the
most complete view of your business, driving the insights needed to boost performance.
Maintain ongoing data integrity. Your ability to exploit big data for competitive advantage
depends on its consistency and accuracy. Data quality and master data management technologies
will ensure the information in your Hadoop environment is fit for purpose at all times.
As data volumes grow dramatically, so does the need for effective information management. iWay
data integrity and integration solutions from Information Builders help you derive the greatest
possible value from your big data analytic repositories. Our technologies work seamlessly in
Hadoop environments, ensuring the quality, consistency, completeness, and availability of even
the largest volumes of data.
Worldwide Offices
Corporate Headquarters
International
Australia*
Melbourne 61-3-9631-7900
Sydney 61-2-8223-0600
United States
Atlanta, GA* (770) 395-9913
Boston, MA* (781) 224-7660
Channels (770) 677-9923
Charlotte, NC (980) 215-8416
Chicago, IL* (630) 971-6700
Cincinnati, OH* (513) 891-2338
Dallas, TX* (972) 398-4100
Denver, CO* (303) 770-4440
Detroit, MI* (248) 641-8820
Federal Systems, D.C.* (703) 276-9006
Florham Park, NJ (973) 593-0022
Houston, TX* (713) 952-4800
Los Angeles, CA* (310) 615-0735
Minneapolis, MN* (651) 602-9100
New York, NY* (212) 736-4433
Philadelphia, PA* (610) 940-0790
Pittsburgh, PA (412) 494-9699
San Jose, CA* (408) 453-7600
Seattle, WA (206) 624-9055
St. Louis, MO* (636) 519-1411, ext. 321
Washington, D.C.* (703) 276-9006
Middle East
Barmajiat Information Technology, LLC
Dubai 971-4-420-9100
n Bahrain n Kuwait n Oman n Qatar
n Saudi Arabia n United Arab Emirates (UAE)
Innovative Corner Est.
Riyadh 966-1-2939007
n Iraq n Lebanon n Oman
Saudi Arabia
UAE
Netherlands*
Amstelveen 31 (0)20-4563333
n Belgium
n Luxembourg
Nigeria InfoBuild Nigeria
Garki-Abuja 234-9-290-2621
Norway InfoBuild Norge AS c/o Okonor
Tynset 358-0-207-580-840
Portugal
Lisboa 351-217-217-400
Finland InfoBuild Oy
Espoo 358-207-580-840
South Korea
Dfocus Co., Ltd.
Seoul 02-3452-3900
France*
Suresnes +33 (0)1-49-00-66-00
Germany
Eschborn* 49-6196-775-76-0
Greece Applied Science Ltd.
Athens 30-210-699-8225
Guatemala IDS de Centroamerica
Guatemala City (502) 2412-4212
India* InfoBuild India
Chennai 91-44-42177082
Israel SRL Software Products Ltd.
Petah-Tikva 972-3-9787273
Italy
Agrate Brianza 39-039-59-66-200
Japan KK Ashisuto
Tokyo 81-3-5276-5863
UVANSYS, Inc.
Seoul 82-2-832-0705
Southeast Asia Information Builders SEAsia Pte. Ltd.
Singapore 60-172980912
n Bangladesh n Brunei n Burma n Cambodia
n Indonesia n Malaysia n Papua New Guinea
n Thailand n The Philippines n Vietnam
Spain
Barcelona 34-93-452-63-85
Bilbao 34-94-400-88-05
Madrid* 34-91-710-22-75
Sweden
Stockholm 46-8-76-46-000
Switzerland
Wallisellen 41-44-839-49-49
Taiwan
Azion Corporation
Taipei 886-2-2356-3996
Galaxy Software Services, Inc.
Taipei 886-2-2586-7890, ext. 114
Mexico
Mexico City 52-55-5062-0660
United Kingdom*
Uxbridge Middlesex 44-20-7107-4000
Corporate Headquarters Two Penn Plaza, New York, NY 10121-2898 (212) 736-4433 Fax (212) 967-6406
Connect With Us
informationbuilders.com askinfo@informationbuilders.com
DN7508337.0416
Copyright 2016 by Information Builders. All rights reserved. [137] All products and product names mentioned in this publication are
trademarks or registered trademarks of their respective companies.