Vous êtes sur la page 1sur 8

Dell | Hadoop White Paper Series

Hadoop in
the Enterprise

A Dell Technical White Paper
By Joey Jablonski


Dell | Hadoop White Paper Series: Hadoop in the Enterprise
2

Table of Contents
Introduction 3
Managing Hadoop as an island versus part of corporate IT 3
Top challenges and methods to overcome 4
Automated deployment 4
Configuration management 4
Monitoring and alerting 4
Hardware sizing 4
Hadoop node sizing 4
Hadoop cluster sizing parameters 4
Hadoop configuration parameters 5
Hadoop network design 5
Dell recommended network architecture 6
Hadoop security 7
Authentication 7
Authorization 7
Logging 7
Why Dell 7
About the author 7
Special thanks 8
About Dell Next Generation Computing Solutions 8
References 8
To learn more 8









This white paper is for informational purposes only, and may contain typographical errors and technical inaccuracies. The content is provided as is, without express
or implied warranties of any kind.

2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden.
For more information, contact Dell. Dell, the Dell logo, and the Dell badge, and PowerEdge are trademarks of Dell Inc.

Dell | Hadoop White Paper Series: Hadoop in the Enterprise
3
Introduction
This is the second in a series of white papers from Dell about Hadoop. If you are new to Hadoop or Dells Hadoop
solutions, Dell recommends reading the Introduction to Hadoop white paper before this paper.
Hadoop is becoming a critical part of many modern information technology (IT) departments. It is being used for a
growing range of requirements, including analytics, data storage, data processing, and shared compute resources. As
Hadoops significance grows, it is important that it be treated as a component of your larger IT organization, and
managed as one. Hadoop is no longer relegated to only research projects, and should be managed as your company
would manage any other large component of your IT infrastructure.
Analytics is a growing need in many organizations. As the volumes of data from multiple sources continue to grow,
Hadoop is in the lead as an enterprise tool that can ingest that data and provide the means to analyze the data.
Data warehouses are a key component to many IT departments. They provide the business intelligence that many
companies use for decision making. Hadoop is beginning to augment these as a central location to feed many different
business intelligence platforms within an organization. This means that Hadoop must be available at the same level or
greater than the tools it feeds and supports.
Hadoop has taken a standard path into most IT organizations, first being used for testing and development, then
migrating into production operations. Because of this, your IT department must ensure that the Hadoop environment is
properly planned and designed at initial deployment to support the more rigorous demands of a production environment.
This paper talks about the considerations for your IT department to ensure your Hadoop environment is able to grow and
change with your business needs, without requiring major refactoring work after the initial deployment due to suboptimal
solution architecture.
Managing Hadoop as an island versus part of corporate IT
Hadoop environments can contain many hundreds or possibly thousands of servers. This large number of devices can
become a management burden for your IT department. As your environment grows, your IT administrators can be left
struggling with complexity.
Maximum attention should be paid during your initial Hadoop deployment to ensure it does not become an island that IT
must manage outside of standard tools and processes. A Hadoop environment, optimally, should utilize external company
shared resources for authentication, monitoring, backup, alerting, and processes. This integration, early and often, will
ensure that the Hadoop environment does not consume an unnecessary amount of time from the IT department relative
to other applications within the corporation.
The design of a Hadoop solution should be optimized for the performance needs and usage model of the intended
Hadoop environment. That is not to say it should completely contradict other solutions in the environment. The Hadoop
environment should share IT best practices and processes with other solutions in the enterprise to ensure consistency in
deployment and operations. This consistency can be either between common hardware or software across IT
environments. Common hardware will ensure ease of servicing while common software will ensure that tools, scripts, and
processes do not require major modification for supporting the Hadoop environment.
Hadoop deployment can be a time-consuming process. The number of systems involved in a Hadoop environment can
easily overwhelm the most experienced of system administrators. It is important that an automated solution be utilized for
deploying the Hadoop environment, both to save time and ensure consistency. If your enterprise has an existing
operating system (OS) deployment strategy, it should be evaluated to determine if it will be viable for Hadoop
deployment; if not, a vendor deployment strategy for the Hadoop environment should be considered.
Operating most IT environments can be more costly than the initial purchase and deployment. Processes for IT
operations, support, and escalations should be updated to accommodate the Hadoop environmentyou do not want to
start over and create a parallel set of processes and structure. Like any IT environment, Hadoop will require regular
monitoring and user support. Documentation should be updated to accommodate the differences in the Hadoop
environment, and staff should be trained to ensure long-term support of the Hadoop environment.
Dell | Hadoop White Paper Series: Hadoop in the Enterprise
4
Top challenges and methods to overcome
Automated deployment
Automated deployment of both operating systems and the Hadoop software ensures consistent, streamlined
deployments. Dell provides proven configurations that are documented and simpler to deploy than traditional manual IT
deployment strategies. Dell augments these proven solutions with services and tools to streamline solution deployment,
testing, and field validation.
Configuration management
Dell recommends the use of a configuration management tool for all Hadoop environments. Dell | Hadoop solutions
include Chef for this purpose. Chef is used for deploying configuration changes, managing the installation of the Hadoop
software, and providing a single interface to the Hadoop environment for updates and configuration changes.
Monitoring and alerting
Hardware monitoring and alerting is an important part of all dynamic IT environments. Successful monitoring and alerting
ensures that problems are caught as soon as possible and administrators are alerted so the problems can be corrected
before users are impacted. Dell provides support for integration with Nagios and Ganglia as part of our Hadoop solution
stack for monitoring the software and hardware environment.
Dell also recommends integrating the Hadoop environment into any existing enterprise monitoring and management
packages. Dell | Hadoop solutions support standard interfaces, including Simple Network Management Protocol (SNMP)
and Intelligent Platform Management Interface (IPMI) for integration with third-party management and operations tools.
Hardware sizing
Sizing Hadoop environments is often a trial-and-error-prone process. With Dell | Hadoop solutions, Dell provides tested,
known configurations to streamline the sizing process and deliver sizing guidance based on known, real-world
workloads.
There are two aspects to sizing within a Hadoop environment:
1. Individual node size This is the hardware configuration of each individual node.
2. Hadoop cluster This is the size of the entire Hadoop environment, including the number of nodes and
interconnects between them.
Hadoop node sizing
The four primary sizing considerations for Hadoop nodes are physical memory, disk capacity, network bandwidth, and
CPU speed and core count. A properly balanced Hadoop node will contain enough in each category, but not create a
bottleneck to negatively impact performance by a shortage of another.
Simpler is better when sizing any Hadoop environment. Feedback from seasoned Hadoop users has always been the
same recommendation: Targeting a middle-ground configuration, with a good balance of the key components of the
individual servers, is better than trying to optimize the hardware design for the expected workload. First, workloads
change much more often than the hardware is replaced. Second, the mix of user workloads will also change over time,
and a balanced configuration, while not optimized for a specific use, will not penalize any one job type over another.
Hadoop cluster sizing parameters
Sizing a Hadoop cluster is a different activity from sizing the individual nodes, and should be planned for accordingly.
There are separate parameters that must be considered when sizing the entire cluster; these include number of jobs to be
run, data volume, expected growth, and number of users. Considerations should also be taken for the amount of data to
be replicated, the number of data replicas, and the availability needs of the Hadoop cluster.
Hadoop contains more than 200 tunable parameters, many of which will not influence your specific job. Dell | Hadoop
solutions provide recommended configuration parameters for each of the three use cases of Hadoop: Compute, Storage,
and Database. Dell has validated these parameters for documented workloads, streamlining system bring-up and tuning
for new Hadoop environments.

Dell | Hadoop White Paper Series: Hadoop in the Enterprise
5
Hadoop configuration parameters
Hadoop tuning and optimization is a never-ending process. This recurring process is depicted in Figure 1. The beginning
of each cycle is Determine parameter to change. The process then goes in a clockwise fashion and repeats. All Hadoop
environments will need to be reviewed for tuning and performance optimization as often as the mix of jobs and workload
characteristics change.

Figure 1. The ongoing Hadoop tuning and optimization process.
Hadoop network design
The network design is a key component of a Hadoop environment, and has important factors related to the expected
usage of the environment, as well as the scalability of the environment. The network design should factor in the expected
workloads, provide for adequate growth of the environment without a major rework, and support monitoring of the
network to alert administration staff to network congestion.
Hadoop isolation is a primary design requirement for Hadoop clusters. The network that the Hadoop nodes use for node-
to-node communication should be isolated from the other corporate networks. This ensures maximum bandwidth for
the Hadoop environment and minimizes the impacts of other operations negatively affecting Hadoop.
One important consideration of the network design for Hadoop solutions is the switch capabilities to accommodate
high-packet-count communication patterns. Hadoop can create large volumes of packets on the network during rebuild
operations as well as during high Hadoop Distributed File System (HDFS) I/O activity. The network architecture should
ensure the switches have the capability to handle this traffic pattern without additional network latency or dropped
packets. The Dell white paper Introduction to Hadoop covers our network architecture in greater detail.
Like many modern applications, Hadoop utilizes IP for all communications. This means Hadoop has the same restrictions
and design requirements you would see for any other network regarding broadcast domains, network separation, and
routing and switching considerations. Dell recommends that Hadoop clusters utilize approximately 60 nodes within a
single switched environment. For clusters larger than that, Dell recommends utilizing a layer 3 network device to segment
the network and maintain adequately sized broadcast domains.
Today, most Hadoop solutions are built utilizing Gigabit Ethernet. Many DataNodes will utilize two Gigabit Ethernet
connections in an aggregated link to the switch. This provides additional bandwidth for the node. Some users are
beginning to look at 10 Gigabit Ethernet as the price continues to come down. The majority of the users today do not
Dell | Hadoop White Paper Series: Hadoop in the Enterprise
6
require the additional performance of 10 Gigabit Ethernet, and can save some cost in their Hadoop environments by
utilizing Gigabit Ethernet.
Availability of the network infrastructure is a primary concern if the Hadoop environment is critical to business operations
or functions. The network should include the appropriate level of redundancy to ensure business functions will continue if
components within the network fail. Hadoop has facilities within the software for handling failures of the hardware and
network; these should be accounted for when determining which tiers of the network will contain redundant
components and which will not. You can get a lot of efficiencies by leveraging Hadoops capabilities to replicate
information to separate racks or servers on alternate switches, allowing access in the event of a network failure.
Dell recommended network architecture
Figure 2 shows the recommended network architecture for the top-of-rack (ToR) switches within a Hadoop environment.
These connect directly to the DataNodes and allow for all inter-node communication within the Hadoop environment.
The standard configuration Dell recommends is six 48-port Gigabit Ethernet switches (i.e. Dell PowerConnect 6248)
that have stacking capabilities. These six switches will be stacked and act as a single switch for management purposes.
This network configuration will support up to 60 DataNodes for Hadoop; additional nodes can easily be added by utilizing
an end-of-row (EoR) switch as described in the next section.

Figure 2. Recommended network architecture for the top-of-rack switches within a Hadoop environment.

Hadoop is a highly scalable software platform that requires a network designed with the same scalability in mind. To
ensure maximum scalability, Dell recommends a network architecture that allows you to start with small Hadoop
configurations and grow those over time by adding components, without requiring a rework of the existing environment.
To that design goal, Dell recommends the use of two switches acting as EoR devices. These will connect to the ToR
switches, as shown in Figure 3, but add routing and advanced functionality to scale above 60 DataNodes. These two EoR
switches will allow a maximum of 720 DataNodes within the Hadoop environment before an additional layer of network
connectivity is needed or larger switches are required for EoR connectivity.


Figure 3. Two switches acting as EoR devices and connecting to ToR switches.
Dell | Hadoop White Paper Series: Hadoop in the Enterprise
7
Hadoop security
Authentication
Hadoop supports a variety of authentication mechanisms, including Kerberos, Active Directory, and Lightweight Directory
Access Protocol. These all enable a list of authorized users and their credentials to be centrally stored and validated
against from the Hadoop environment. Dell recommends utilizing an existing companywide authentication scheme for
Hadoop, eliminating the need to support a separate authentication system supporting only Hadoop.
Authorization
Authorization is an additional layer on top of the authentication that must occur for users. Authentication verifies that the
user credentials are correctin essence that the user name and password are correct, active, and valid for some period of
time. Authorization builds on that validation to determine if the user is allowed to do the action requested. Authorization
is commonly implemented as file permissions in a file system and access controls within a relational database
management system (RDBMS).
By utilizing centralized authentication and authorization for Hadoop and other corporate services, security models can be
developed between the environments to ensure permissions are properly mapped across environments. Hadoop is
commonly used as an intermediary point for data processing and storage; it should enforce all corporate security policies
for data that passes through Hadoop and is processed by Hadoop.
Logging
Logging is a critical part of both system operation and security. The logs for Hadoop and the underlying systems provide
insight into unexpected access that could lead to a system or data compromise as well as system stability issues. Hadoop
environments should utilize a central logging facility for correlation of logs, recovery of logs from failed hosts, and event
alerting of unexpected events and anomalies.
In development environments, as well as when your environment grows, it will be beneficial for comparing logs from
Hadoop to the logs from the application utilizing Hadoop. This capability enables your administrators to correlate
problems in the entire stack of components that make up a functioning application.
Why Dell
Dell has worked with Cloudera to design, test, and support an integrated solution of hardware and software for
implementing the Hadoop ecosystem. This solution has been engineered and validated to work together and provide
known performance parameters and deployment methods. Dell recommends that you utilize known hardware and
software solutions when deploying Hadoop to ensure low-risk deployments and minimal compatibility issues. Dells
solution ensures that you get maximum performance with minimal testing prior to purchase.
Dell recommends that you purchase and maintain support on the entire ecosystem of your Hadoop solution. Todays
solutions are complex combinations of components that require upgrades as new software becomes available and
assistance when staff is working on new parts of the solutions. The Dell | Cloudera Hadoop solution provides a full line of
support, including hardware and software, so you always have a primary contact for assistance to ensure maximum
availability and stability of your Hadoop environment.
About the author
Joey Jablonski is a principal solution architect with Dells Data Center Solutions team. Joey works to define and
implement Dells solutions for big data, including solutions based on Apache Hadoop. Joey has spent more than 10 years
working in high performance computing, with an emphasis on interconnects, including Infiniband and parallel file
systems. Joey has led technical solution design and implementation at Sun Microsystems and Hewlett-Packard, as well as
consulted for customers, including Sandia National Laboratories, BP, ExxonMobil, E*Trade, Juelich Supercomputing
Centre, and Clumeq.
Dell | Hadoop White Paper Series: Hadoop in the Enterprise
8
Special thanks
The author extends special thanks to:
Aurelian Dumitru, Senior Cloud Solutions Architect, Dell
Rebecca Brenton, Cloud Software Alliances, Dell
Scott Jensen, Director, Cloud Solutions Software Engineering, Dell

About Dell Next Generation Computing Solutions
When cloud computing is the core of your business and its efficiency and vitality underpin your success, the Dell Next
Generation Computing Solutions are Dells response to your unique needs. We understand your challengesfrom
compute and power density to global scaling and environmental impact. Dell has the knowledge and expertise to tune
your companys factory for maximum performance and efficiency.
Dell Next Generation Computing Solutions provide operational models backed by unique product solutions to meet the
needs of companies at all stages of their lifecycles. Solutions are designed to meet the needs of small startups while
allowing scalability as your company grows.
Deployment and support are tailored to your unique operational requirements. Dell Cloud Computing Solutions can help
you minimize the tangible operating costs that have hyperscale impact on your business results.
References
Chef
http://wiki.opscode.com/display/chef/Home
SNMP
http://www.net-snmp.org/
IPMI
http://www.intel.com/design/servers/ipmi/ipmi.htm
Cloudera Hadoop
http://www.cloudera.com/products-services/enterprise/



To learn more
To learn more about Dell cloud solutions, contact your Dell representative
or visit:

www.dell.com/cloud




2011 Dell Inc. All rights reserved. Dell, the DELL logo, the DELL badge and PowerConnect are trademarks of Dell Inc. Other trademarks and trade names may be used in this document to
refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others. This document is for informational purposes
only. Dell reserves the right to make changes without further notice to the products herein. The content provided is as-is and without expressed or implied warranties of any kind.

Vous aimerez peut-être aussi