Vous êtes sur la page 1sur 20

Intro: Maximizing the fourth v of Big Data

Pitfall #1: Hadoop is not a data integration tool Pitfall #2: MapReduce programmers are hard to nd Pitfall #3: Most data integration tools dont run natively within Hadoop Pitfall #4: Hadoop may cost more than you think Pitfall #5: Elephants dont thrive in isolation

3 4 6 9 12 15

Benchmark 18 Conclusion 19

Traditional business intelligence architectures are struggling to efficiently process Big Data sets, particularly massive semi-structured and unstructured data. Therefore, its been difficult to realize the full potential of Big Data. Hadoop allows organizations to overcome the architectural limitations in managing Big Data, but care needs to be taken in order to make the most of what Hadoop has to offer. Big Data is commonly characterized with respect to the three vs that is high-volume, high-velocity, and highvariety of data assets but what really matters is the fourth v: value. Value is the positive impact on the business in terms of gaining actionable insight from massive amounts of data. Big Data can uncover signicant value for organizations, for example: new revenue streams, new customer insights, improved decision making, better quality products, improved customer experience, and so on. Hadoop has emerged as the de facto Big Data analytics operating system to help deal with the avalanche of data coming from logs, email, sensor devices, mobile devices, social and more. While business intelligence systems are typically the last stop in extracting value from Big Data, the rst stop is commonly manipulation of the data in a process called Extract, Transform, Load (ETL). ETL is the process by which data is moved from source systems, manipulated into a consumable format and loaded into a target system for performing advance analytics, analysis and reporting. In fact, industry analyst Gartner recognizes that most organizations will adapt their data integration strategy using Hadoop as a form of preprocessor for Big Data integration in the data warehouse. However, as organizations begin to deploy this new framework, there are some pitfalls to avoid in successfully performing ETL with Hadoop. First, businesses need to know the pitfalls, and then how to overcome the challenges. We will offer some guiding principles to address these challenges, as well as specic details on how to leverage Syncsorts data integration tool for Hadoop, DMX-h, to drive sustainable success with your Hadoop deployment.
3

A data integration tool provides an environment to make it easier for a broad audience to develop and maintain ETL jobs. Typical capabilities of a data integration tool include: an intuitive graphical interface, pre-built data transformation functions (aggregations, joins, change data capture [CDC], cleansing, ltering, reformatting, lookups, data type conversions, and so on), metadata management to enable re-use and data lineage, powerful connectivity to source and target systems, and advanced features to make data integration easily accessible by data analysts. Although the primary use case of Hadoop is ETL, Hadoop is not a data integration tool itself. Rather, Hadoop is a reliable, scale-out parallel processing framework, meaning servers (nodes) can be easily added as workloads increase. It frees the programmer from concerns about how to physically manage large data sets when spreading processing across multiple nodes. There is a rich ecosystem of Hadoop utilities that can be used to create ETL jobs, but they are all separately evolving projects and require specic, new skills. For example, Sqoop development (to move data into and out of HDFS from RDBMSs) requires skilled programmers knowledgeable in the Sqoop command line syntax. Flume is used for moving data from a variety of systems into Hadoop; Oozie helps with workows; and Pig is a scripting platform for more easily creating Hadoop jobs. However, they all require much hand-coding, as well as specialized skills and knowledge of Hadoop and MapReduce. Finally, basic ETL operations such as data transformations are easy within a mature data integration tool. However, trying to accomplish the same task with Hadoop can quickly become complex and take a lot of expertise and effort. For example, building a simple CDC process can easily translate into hundreds of lines of code that not only takes several days to develop, but also requires resources to maintain and tune as needs evolve in the future. Alternatively, a preferred approach is to utilize a data integration tool that makes it easy to create and maintain Hadoop ETL jobs.
4

ETL is emerging as the key use case for Hadoop implementations. However, Hadoop alone lacks many attributes needed for successful ETL deployments. Therefore, its important to choose a data integration tool that can ll the ETL gaps. Choose a user-friendly graphical interface to easily build ETL jobs without writing MapReduce code. Ensure that the solution has a large library of pre-built data integration functions that can be easily reused. Include a metadata repository to enable re-use of developments, as well as data lineage tracking. Select a tool with a wide variety of connectors to source and target systems.

Syncsort DMX-h is high-performance data integration software that provides a smarter approach to Hadoop ETL including: an intuitive graphical interface for easily creating and maintaining jobs, a wide range of productivity features, metadata facilities for development re-use and data lineage, high-performance connectivity capabilities, and an ability to run natively, avoiding code generation.

Programming with the MapReduce processing paradigm in Hadoop requires not only Java programming skills, but also a deep understanding of how to develop the appropriate Mappers, Reducers, Partitioners, Combiners, etc. A typical Hadoop task often has multiple steps (as shown in the image on the next page) and a typical application can have multiple tasks. Most of these steps need to be coded by a Java developer (or using Pig script). With hand-coding, these steps can quickly become unwieldy to create and maintain. Even with expert MapReduce programmers building jobs successfully, MapReduce code has limited metadata associated with it. This issue makes impact analysis and data lineage difficult to perform and thus creates an overall lack of transparency into the ETL execution ow. Ultimately, thousands of lines of Java code without any metadata and limited documentation produces major risks for organizations, specically hindering business agility, complicating data governance, and jeopardizing regulatory compliance. Not only does MapReduce programming require specialized skills that are hard to nd and expensive, handcoding does not scale well in terms of job creation productivity, job re-use, and job maintenance. Thats where data integration tools excel, with intuitive graphical interfaces, prebuilt functions, and facilities to easily create, reuse, and maintain ETL jobs. With data integration tools, business analysts can easily create, maintain, and re-use jobs in minutes or hours in a graphical manner that would otherwise take days or weeks with a developer writing thousands of lines of code. Easy job creation and maintenance are critical in preventing bottlenecks that reduce an organizations ability to extract the full value of Big Data.

Input Formatter

MAP

SORT

Optional Partitioner

Local Disk

Optional Combiner

SORT

REDUCE

Ouput Formatter

HDFS

Local Disk Input Formatter

MAP

SORT Ouput Formatter

Optional Partitioner

Local Disk

Optional Combiner

SORT

REDUCE

HDFS

Local Disk Input Formatter

MAP

SORT

Optional Partitioner

Local Disk

Optional Combiner

Hadoop ETL requires organizations to acquire a completely new set of advanced programming skills that are expensive and difficult to nd. To overcome this pitfall its critical to choose a data integration tool that both complements Hadoop and also leverages skills organizations already have. Select a tool with a graphical user interface (GUI) that abstracts the complexities of MapReduce programming. Look for pre-built templates specically to create MapReduce jobs without manually writing code. Insist on the ability to re-use previously created MapReduce ows as means to increase developers productivity. Avoid code generation since it frequently requires tuning and maintenance. Visually track data ows with metadata and lineage

Using DMX-h reduces or eliminates the need for costly, hard-to-nd MapReduce programmers. With DMX-h, Mappers and Reducers are all built through an easy-to-use graphical development environment, eliminating the need to write any code. DMX-h provides powerful and highly efficient out-of-the-box capabilities for all key ETL functions and transformations. DMX-h Mapper and Reducer steps can optionally perform processing that eliminates the need for other steps in the MapReduce processing ow (including the InputFormatter, Partitioner, Combiner, and OutputFormatter) by simply checking options in the DMX-h graphical user interface. There are a number of other benets inherent in DMX-h as a powerful data integration tool that make MapReduce programming more efficient. First, its easy to develop ETL jobs that execute within MapReduce by using pre-dened templates and accelerators for common transformations such as CDC, joins, and more. Second, jobs can be easily re-used to create new data ows in less time, improving developer productivity. Additionally, built-in metadata capabilities enable greater transparency into impact analysis, data lineage, and execution ow, thereby facilitating data governance and regulatory compliance. No code generation means there is no code to maintain or tune. As a result, organizations can minimize or even eliminate the need to nd and acquire new MapReduce skills. Instead, they can leverage ETL expertise within their existing staff to quickly learn and implement ETL processes in Hadoop, using DMX-h.

Most data integration solutions offered for Hadoop do not run natively and generate hundreds of lines of code to accomplish even simple tasks. This can have a signicant impact on the overall time it takes to load and process data. Thats why its critical to choose a data integration tool that is tightly integrated within Hadoop and can run natively within the MapReduce framework. Moreover, its important to consider not only the horizontal scalability inherent to Hadoop, but also the vertical scalability within each node. Remember, vertical scalability is about the processing efficiency of each node. A good example of vertical scalability is sorting, a key component of every MapReduce process (equally important is connectivity efficiency, covered in Pitfall #5). When vertical scalability is most efficient, it also delivers the fastest job processing time, thereby reducing overall time to value. Unfortunately, many data integration tools add a layer of overhead that hurts performance. Most data integration tools are peripheral to Hadoop. They simply interact with Hadoop from the outside, treating it as just another target engine to push processing. They take the same approach as with relational databases the so-called pushdown optimizations. This means they generate code, in most cases Java, Pig or HiveQL, which then needs to be compiled before it is executed in Hadoop. Generating optimum code is not trivial, and most of these tools can end up generating very inefficient code that developers then need to understand, ne-tune, and maintain. Instead, it is better to run natively within Hadoop with no need to pre-compile, which is both easier to maintain and more efficient, eliminating processing overhead.

Most data integration tools are simply code generators that add extra overhead to the Hadoop framework. A smarter approach must fully integrate with Hadoop and provide means to seamlessly optimize performance without adding complexity. Understand how different solutions are specically interacting with Hadoop and the amount of code that they are generating. Choose solutions with the ability to run natively within each Hadoop node without generating code. Run performance benchmarks and study which tools deliver the best combination of price and performance for your most common use cases. Select an approach with built-in optimizations to maximize Hadoops vertical scalability.

DMX-h provides a truly integrated approach to Hadoop ETL. DMX-h is not a code generator. Instead, Hadoop automatically invokes the highly efficient DMX-h runtime engine, which executes on all nodes as an integral part of the Hadoop framework. DMX-h automatically optimizes the resource utilization (e.g., CPU, memory and I/O) on each node to deliver the highest levels of performance, scalability, and throughput, with no manual tuning needed. Compared with Java or Pig, DMX-h execution is typically 2 to 3x faster, which means it can process more data in the same amount of time without the need for additional nodes. DMX-h has a very small footprint with no dependencies on third-party systems like a relational database, compiler, or application server for design or runtime. As a result, DMX-h can be easily installed and deployed on every data node in a Hadoop cluster or on virtualized environments in the cloud. Syncsort accomplishes these performance differentiators by leveraging a number of contributions the company has made to the Apache Hadoop open source community, including a new feature to allow for an external sort implementation within the MapReduce framework (MAPREDUCE-2454 ). Therefore, organizations using Hadoop no longer have to rely on the standard Hadoop sort, but can plug in their own sort as well.

10

The pluggable sort option also enables development of MapReduce jobs within the DMX-h graphical interface. Additionally, it allows the DMX-h engine to run natively within the Hadoop cluster nodes. This approach makes it much easier to implement common tasks that are difficult to execute in Hadoop (e.g., joins). For all Hadoop users, this new feature enables more sophisticated manipulation of data within Hadoop like hash aggregations, hash joins, sampling N matches or even a no-sort option (i.e. ability to bypass sort when not needed/ redundant).

11

Hadoop is signicantly disrupting the cost structure of processing data at scale. However, deploying Hadoop is not free, and signicant costs can add up. Vladimir Boroditsky, a director of software engineering at Googles Motorola Mobility Holdings Inc., recognized in a Wall Street Journal article that there is a very substantial cost to free software, noting that Hadoop comes with additional costs of hiring in-house expertise and consultants. In all, the primary costs to consider for a complete enterprise data integration solution powered with Hadoop include: software, technical support, skills, hardware and time-to-value. The rst three factors software, support, and skills should be considered together. While the Hadoop software itself is open source and free, typically its desirable to purchase a support subscription with an enterprise service level agreement (SLA). Likewise, its important to consider the software and subscription costs as a whole when considering the data integration tool to work in tandem with Hadoop. In terms of skills, the Wall Street Journal cites that a Hadoop programmer, also sometimes referred to as a data scientist, can easily command at least $300,000 per year. Although the data integration tool may add costs on the software and support side, using the right tool can reduce overall costs of development and maintenance by dramatically reducing time to build and manage Hadoop jobs. Finally, data integration tool skills are much more broadly available and much less expensive than the specialized Hadoop MapReduce developer skills. While Hadoop leverages commodity hardware, associated costs can still be signicant. When dealing with dozens of nodes over months and years, hardware costs add up, whether commodity or not. Therefore, it is still important to use hardware in the most efficient manner. Unfortunately, Hadoops core mechanics of MapReduce are inefficient with respect to processing data on each individual node. The strategy with Hadoop is to spread the processing and data across many nodes so that inefficiencies such as sorting are minimized. However, the inefficiencies are

12

still there and add up as the number of nodes grows. Vertical scalability is critical to contain costs associated with growing Hadoop clusters. Therefore, its important to consider data integration tools that can complement Hadoop with the ability to maximize processing efficiency on each node, for example, by enabling Hadoop to call more efficient sort algorithms and seamlessly optimize MapReduce operations. Time-to-value is the time difference between the time needed to create and deploy jobs and when an organization may start extracting value from Big Data. This dimension is another benet of using a data integration tool with a graphical interface to speed development and maintenance. The time to create ETL jobs and deploy them into production is dramatically lower when using the right data integration tool as opposed to using Hadoop utilities such as Pig, Hive, and Sqoop.

Hadoop provides virtually unlimited horizontal scalability. However, hardware and development costs can quickly hinder sustainable growth. Therefore, its important to maximize developer productivity and per-node efficiency to contain costs. Choose cost-effective software and support, including both the Hadoop distribution and the data integration tool. Ensure tools include features to reduce development and maintenance efforts of MapReduce jobs. Look for optimizations that enhance Hadoops vertical scalability to reduce hardware requirements.

13

DMX-h dramatically reduces costs of leveraging Hadoop in a number of ways. First, DMX-h reduces time-to-value by making the development of Hadoop jobs much faster and easier than manual coding. With DMX-h, there is no need to hire additional programmers to implement Hadoop ETL. For the most part, you can leverage existing skills within the organization or more easily nd data integration tool developers at a more reasonable cost. In terms of hardware, a rule-of-thumb cost for one Hadoop node is about $5,000. However, when adding the operating system (for example a support subscription), cooling, maintenance, power, rack space, etc., the total cost can grow to $12,000. And that does not include administration costs. DMX-h enables Hadoop clusters to scale more efficiently and cost-effectively by maximizing vertical scalability of each individual node. With more efficient hardware utilization, organizations can reduce capital and operational expenses by eliminating the need for additional compute nodes on the cluster.

14

One of Hadoops hallmark strengths is its ability to process massive data volumes of nearly any type. But that strength cannot be fully utilized unless the Hadoop cluster is adequately connected to all available data sources and targets, including relational databases, les, CRM systems, social media, mainframe and so on. However, moving data in and out of Hadoop is not trivial. Moreover, with the birth of new categories of data management technologies, broadly generalized as NoSQL and NewSQL, mission critical systems like mainframes can all too often be neglected. The fact is that at least 70% of the worlds transactional production applications run on mainframe platforms. The ability to process and analyze mainframe data with Hadoop could open up a wealth of opportunities by delivering deeper analytics, at lower cost, for many organizations. Shortening the time it takes to get data into the Hadoop Distributed File System (HDFS) can be critical for many companies, such as those that must load billions of records each day. Reducing load times can also be important for organizations that plan to increase the amount and types of data they will need to load into Hadoop, as their application or business grows. Finally, pre-processing data before loading into Hadoop is vital in order to lter out noise of irrelevant data, achieve signicant storage space savings, and optimize performance.

15

Without the right connectivity, Hadoop risks becoming another data silo within the enterprise. Tools to get the needed data in and out of Hadoop at the right time are critical to maximize the value of Big Data. Select tools with a wide range of native connectors, particularly for popular relational databases, appliances, les and systems. Dont forget to include mainframe data in your Hadoop and Big Data strategies. Make sure connectivity is provided not only from a stand-alone data integration server to Hadoop, but also directly from the Hadoop cluster itself to a variety of sources and targets. Look for connectors that dont require writing additional code. Ensure high-performance connectivity in both loading and extracting data from various sources and targets.

DMX-h offers a range of high-performance connectors for every major RDBMS, appliances, XML, at les, legacy sources and even mainframes. DMX-h writes data directly to HDFS using native Hadoop interfaces. DMX-h can partition the data and parallelize the loading processes to load multiple streams simultaneously into HDFS, reducing the time to load data into HDFS by up to 6x.

File-Based Source
Flat Mainframe HDFS Legacy Sources

RDBMS
Oracle DB2 SQL Server Teradata Sybase ODBC

Appliances
Netezza Greenplum Vertica

Other
XML MQ Salesforce.com

16

DMX-h can also connect directly from each data node in the cluster, to virtually any source and target for even greater efficiency and faster data movement. Finally, Syncsort is commonly used to pre-process data prior to loading it into Hadoop. By rst integrating and structuring the data with Syncsort prior to loading to HDFS, load times are reduced downstream, MapReduce tasks execute faster and more efficiently, and storage requirements on the cluster are reduced.

17

A leading global nancial services organization with trillions of dollars in assets is looking to improve performance of its Hadoop ETL jobs.

18

As the de facto standard for Big Data processing and analytics, Hadoop represents a tremendous vehicle to extract value from Big Data. However, relying only on Hadoop and common scripting tools like Pig, Hive and Sqoop in order to achieve a complete ETL solution can hinder the overall potential value of Big Data. Syncsort DMX-h provides a smarter approach, making Hadoop a more mature environment for enterprise ETL. Development and maintenance are eased, overall costs are dramatically reduced, performance is multiplied, opportunities to leverage every data source are guaranteed, and time-tovalue is minimized. As a high-performance leader in the data integration space, Syncsort has worked with early adopter Hadoop customers to identify and solve the most common pitfalls organizations are facing. Regardless of the approach you take, its important to recognize and address these pitfalls prior to deploying ETL on Hadoop:

Hadoop is not a data integration tool


Select a data integration tool that can dramatically speed development and maintenance efforts

#1

by providing all the capabilities to make Hadoop ETL-ready, including connectivity, breadth of transformations and data processing functions, metadata, reusability and ease-of-use.

MapReduce programmers are hard to nd


Make sure your data integration tool includes specialized facilities to ease MapReduce job

#2

development. Also minimize the need to acquire MapReduce programming skills by selecting a tool that allows you to leverage the same data integration expertise your organization already has, to develop MapReduce jobs without hand-coding.

Most data integration tools dont run natively within Hadoop

#3

Choose a data integration tool that runs natively within the Hadoop framework to minimize data movement and maximize data processing performance within each node. Avoid code generators altogether, as their code output frequently requires tedious tuning and maintenance.

Hadoop may cost more than you think


Do not underestimate the cost of using Hadoop including software, support, hardware, and skills.

#4

Choose a data integration tool that complements Hadoops horizontal scalability with greater performance and efficiency on each node to minimize hardware costs.

Elephants dont thrive in isolation


Unleash Hadoops potential by making sure your data integration tool provides high-performance

#5

connectivity to move data into and out of Hadoop from virtually any system, particularly major relational databases, appliances, les and mainframes.

19

Simplifying and accelerating ETL use cases with Hadoop Hadoop MapReduce: To Sort or Not to Sort 2013: The Year Big Data Gets Bigger

Syncsort provides data-intensive organizations across the big data continuum with a smarter way to collect and process the ever-expanding data avalanche. With thousands of deployments across all major platforms, including mainframe, Syncsort helps customers around the world to overcome the architectural limits of todays ETL and Hadoop environments, empowering their organizations to drive better business outcomes in less time, with less resources and lower TCO. For more information visit www.syncsort.com.

2013 Syncsort Incorporated. All rights reserved. DMExpress is a trademark of Syncsort Incorporated. All other company and product names used herein may be the trademarks of their respective companies.

Vous aimerez peut-être aussi