Vous êtes sur la page 1sur 27

Contents

1. Cloud Basics
2. Virtual Machine Basics, Installation
3. Hadoop Basics
4. Hadoop Single Node Cluster Installation
5. Word Count on your Local Computer
6. Word Count using MapReduce
7. Word Count using Hive

Cloud: Simply put, cloud computing is the delivery of computing services—servers, storage,
databases, networking, software, analytics, intelligence and more—over the Internet (“the
cloud”) to offer faster innovation, flexible resources and economies of scale.
Benefits of Cloud: Cloud computing is a big shift from the traditional way businesses think about
IT resources. Here are six common reasons organisations are turning to cloud computing
services:
Cost: Cloud computing eliminates the capital expense of buying hardware and software and
setting up and running on-site datacenters—the racks of servers, the round-the-clock electricity
for power and cooling, the IT experts for managing the infrastructure. It adds up fast.
Speed: Most cloud computing services are provided self service and on demand, so even vast
amounts of computing resources can be provisioned in minutes, typically with just a few mouse
clicks, giving businesses a lot of flexibility and taking the pressure off capacity planning.
Global scale: The benefits of cloud computing services include the ability to scale elastically. In
cloud speak, that means delivering the right amount of IT resources—for example, more or less
computing power, storage, bandwidth—right when it is needed and from the right geographic
location.
Productivity: On-site datacenters typically require a lot of “racking and stacking”—hardware
set up, software patching and other time-consuming IT management chores. Cloud computing
removes the need for many of these tasks, so IT teams can spend time on achieving more
important business goals.
Performance: The biggest cloud computing services run on a worldwide network of secure
datacenters, which are regularly upgraded to the latest generation of fast and efficient computing
hardware. This offers several benefits over a single corporate datacenter, including reduced
network latency for applications and greater economies of scale.
Security: Many cloud providers offer a broad set of policies, technologies and controls that
strengthen your security posture overall, helping protect your data, apps and infrastructure from
potential threats.
Types of cloud computing
Not all clouds are the same and not one type of cloud computing is right for everyone. Several
different models, types and services have evolved to help offer the right solution for your needs.
Types of cloud deployments: public, private and hybrid
First, you need to determine the type of cloud deployment or cloud computing architecture, that
your cloud services will be implemented on. There are three different ways to deploy cloud
services: on a public cloud, private cloud or hybrid cloud.
Public cloud: Public clouds are owned and operated by a third-party cloud service providers,
which deliver their computing resources like servers and storage over the Internet. Microsoft
Azure is an example of a public cloud. With a public cloud, all hardware, software and other
supporting infrastructure is owned and managed by the cloud provider. You access these services
and manage your account using a web browser.
Private cloud: A private cloud refers to cloud computing resources used exclusively by a single
business or organisation. A private cloud can be physically located on the company’s on-site
datacenter. Some companies also pay third-party service providers to host their private cloud. A
private cloud is one in which the services and infrastructure are maintained on a private network.
Hybrid cloud: Hybrid clouds combine public and private clouds, bound together by technology
that allows data and applications to be shared between them. By allowing data and applications
to move between private and public clouds, a hybrid cloud gives your business greater flexibility,
more deployment options and helps optimise your existing infrastructure, security and
compliance.
Types of cloud services: IaaS, PaaS, serverless and SaaS
Most cloud computing services fall into four broad categories: infrastructure as a service (IaaS),
platform as a service (PaaS), serverless and software as a service (SaaS). These are sometimes
called the cloud computing stack because they build on top of one another. Knowing what they
are and how they are different makes it easier to accomplish your business goals.

Infrastructure as a service (IaaS): The most basic category of cloud computing services. With
IaaS, you rent IT infrastructure—servers and virtual machines (VMs), storage, networks,
operating systems—from a cloud provider on a pay-as-you-go basis.
Platform as a service (PaaS): Platform as a service refers to cloud computing services that
supply an on-demand environment for developing, testing, delivering and managing software
applications. PaaS is designed to make it easier for developers to quickly create web or mobile
apps, without worrying about setting up or managing the underlying infrastructure of servers,
storage, network and databases needed for development.
Serverless computing Overlapping with PaaS, serverless computing focuses on
building app functionality without spending time continually managing the servers
and infrastructure required to do so. The cloud provider handles the setup, capacity
planning and server management for you. Serverless architectures are highly scalable
and event-driven, only using resources when a specific function or trigger occurs.
Software as a service (SaaS): Software as a service is a method for delivering software
applications over the Internet, on demand and typically on a subscription basis. With
SaaS, cloud providers host and manage the software application and underlying
infrastructure and handle any maintenance, like software upgrades and security
patching. Users connect to the application over the Internet, usually with a web
browser on their phone, tablet or PC.
Uses of cloud computing
You are probably using cloud computing right now, even if you don’t realise it. If you use an
online service to send email, edit documents, watch movies or TV, listen to music, play games or
store pictures and other files, it is likely that cloud computing is making it all possible behind the
scenes. The first cloud computing services are barely a decade old, but already a variety of
organisations—from tiny startups to global corporations, government agencies to non-profits—
are embracing the technology for all sorts of reasons.
Here are a few examples of what is possible today with cloud services from a cloud provider:

Create new apps and services


Quickly build, deploy and scale applications—web, mobile and API—on any
platform. Access the resources you need to help meet performance, security and
compliance requirements.
Test and build applications
Reduce application development cost and time by using cloud infrastructures that can easily be
scaled up or down.

Store, back up and recover data


Protect your data more cost-efficiently—and at massive scale—by transferring your data over the
Internet to an offsite cloud storage system that is accessible from any location and any device.

Analyse data
Unify your data across teams, divisions and locations in the cloud. Then use cloud services, such
as machine learning and artificial intelligence, to uncover insights for more informed decisions.

Stream audio and video


Connect with your audience anywhere, anytime, on any device with high-definition video and
audio with global distribution.
Embed intelligence
Use intelligent models to help engage customers and provide valuable insights from the data
captured.

Deliver software on demand


Also known as software as a service (SaaS), on-demand software lets you offer the latest
software versions and updates around to customers—anytime they need, anywhere they are.

Questions for you..


Give few examples of cloud, cloud services
What cloud services do you use?
Is Cloud hardware/software/ a combination of both?

2. Virtual Machine
A virtual machine is a computer file, typically called an image, which behaves like an actual
computer. In other words, creating a computer within a computer. It runs in a window, much like
any other programme, giving the end user the same experience on a virtual machine as they
would have on the host operating system itself. The virtual machine is sandboxed from the rest
of the system, meaning that the software inside a virtual machine cannot escape or tamper with
the computer itself. This produces an ideal environment for testing other operating systems
including beta releases, accessing virus-infected data, creating operating system backups and
running software or applications on operating systems for which they were not originally
intended.
Multiple virtual machines can run simultaneously on the same physical computer. For servers,
the multiple operating systems run side-by-side with a piece of software called a hypervisor to
manage them, while desktop computers typically employ one operating system to run the other
operating systems within its programme windows. Each virtual machine provides its own virtual
hardware, including CPUs, memory, hard drives, network interfaces and other devices. The
virtual hardware is then mapped to the real hardware on the physical machine which saves costs
by reducing the need for physical hardware systems along with the associated maintenance costs
that go with it, plus reduces power and cooling demand.

Installation
1. VirtualBox is a free program provided by Oracle. This is the software that powers the entire
virtualization process. Similarly you can download any virtual box that enables creation of
virtual machines.
2. Download the ISO file of the OS you wish to install.
3. Now that you have VirtualBox installed and OS downloaded, open VirtualBox, click New,
and use the following steps as a guide:
1. Name and operating system. Give the VM a name, choose Linux from
the Type dropdown, and select the Linux version as indicated. Go with Other Linux if
your distribution isn’t listed.
2. Memory size. Select the memory size. This will siphon RAM from your system for the
VM, so don’t overdo it. This number can be changed easily later under the VM settings
menu.
3. Hard drive. Since we’re starting fresh, leave it on the default Create a virtual hard
drive now.
4. Hard drive file type. There are multiple choices here for advanced users.
Choose VDI for now unless you know you will need one of the other options.
5. Storage on physical hard drive. Based on your decision earlier, select
either Dynamically allocated for an expandable drive file or Fixed size for a static drive
file. As the description states, please note that a dynamic drive file will expand as needed,
but will not shrink again automatically when space on it is freed.
6. File location and size. Since you’re making a virtual hard drive within your existing file
space, you can give this file a name and choose where it is stored. Adjust the slider or
type in a specific number in the box to the right to specify the virtual hard drive size.

In VirtualBox, you should now see that there is an item on the left-hand side listing your name
from step 1 followed by Powered Off. Let’s go ahead and get it powered on.

Select your new VM on the left and click Start at the top.

The next window will prompt you to select a start-up disk for the VM. Click on the icon next to
the dropdown to open a file explorer window, then track down the Linux ISO you downloaded
earlier.

Once you have selected the file, click Start. This will take you to the boot-up options for the OS,
and from here you can simply follow the prompts. Many Linux distributions will give you the
option to try the software or outright install it – keep in mind that you have set up a virtual disk
drive for this VM and you will not be affecting your files on the host machine by installing the
new OS.

Questions for you..


1. Give examples of virtual box?
2. How many virtual machines can run in your system at one time?
3. Can you install Linux virtual machine in a Windows system or vice-versa?
4. Can you install a higher version of OS in your VM than your base OS?
5. How many ways can you install OS while creating your VM?
3. Hadoop Basics
Please follow the below three links:
https://www.youtube.com/watch?v=AZovvBgRLIY
https://hadoop.apache.org/old/
http://hadoop.apache.org/
and the book Managing Big Data Work Flows for Dummies attached in Google Classroom
Big data is becoming a catchall phrase, while Hadoop refers to a specific technology framework.
Hadoop is a gateway that makes it possible to work with big data, or more specifically, large data
sets that reside in a distributed environment. One way to define big data is data that is too big to
be processed by relational database management systems (RDBMS). Hadoop helps overcome
RDBMS limitations, so big data can be processed.
Examples of Hadoop
Here are five examples of Hadoop use cases:

1. Financial services companies use analytics to assess risk, build investment models, and create
trading algorithms; Hadoop has been used to help build and run those applications.

2. Retailers use it to help analyze structured and unstructured data to better understand and serve
their customers.

3. In the asset-intensive energy industry Hadoop-powered analytics are used for predictive
maintenance, with input from Internet of Things (IoT) devices feeding data into big data
programs.

4. Telecommunications companies can adapt all the aforementioned use cases. For example, they
can use Hadoop-powered analytics to execute predictive maintenance on their infrastructure. Big
data analytics can also plan efficient network paths and recommend optimal locations for new
cell towers or other network expansion. To support customer-facing operations telcos can
analyze customer behavior and billing statements to inform new service offerings.

5. There are numerous public sector programs, ranging from anticipating and preventing disease
outbreaks to crunching numbers to catch tax cheats.

Hadoop Tutorial: How It All Started?


Before getting into technicalities in this Hadoop tutorial blog, let me begin with an interesting
story on how Hadoop came into existence and why is it so popular in the industry nowadays. So,
it all started with two people, Mike Cafarella and Doug Cutting, who were in the process of
building a search engine system that can index 1 billion pages. After their research, they
estimated that such a system will cost around half a million dollars in hardware, with a monthly
running cost of $30,000, which is quite expensive. However, they soon realized that their
architecture will not be capable enough to work around with billions of pages on the web.
They came across a paper, published in 2003, that described the architecture of Google’s
distributed file system, called GFS, which was being used in production at Google. Now, this
paper on GFS proved to be something that they were looking for, and soon, they realized that it
would solve all their problems of storing very large files that are generated as a part of the web
crawl and indexing process. Later in 2004, Google published one more paper that introduced
MapReduce to the world. Finally, these two papers led to the foundation of the framework
called “Hadoop“. Doug quoted on Google’s contribution in the development of Hadoop
framework:

“Google is living a few years in the future and sending the rest of us messages.”

Hadoop Tutorial: What is Big Data?


Have you ever wondered how technologies evolve to fulfill emerging needs? For example,
earlier we had landline phones, but now we have shifted to smartphones. Similarly, how many of
you remember floppy drives that were extensively used back in 90’s? These Floppy drives have
been replaced by hard disks because these floppy drives had very low storage capacity and
transfer speed. Thus, this makes floppy drives insufficient for handling the amount of data with
which we are dealing today. In fact, now we can store terabytes of data on the cloud without
being bothered about size constraints.

Now, let us talk about various drivers that contribute to the generation of data.

Have you heard about IoT? IoT connects your physical device to the internet and makes it
smarter. Nowadays, we have smart air conditioners, televisions etc. Your smart air conditioner
constantly monitors your room temperature along with the outside temperature and accordingly
decides what should be the temperature of the room. Now imagine how much data would be
generated in a year by smart air conditioner installed in tens & thousands of houses.

Now, let us talk about the largest contributor to the Big Data which is, nothing but, social media.
Social media is one of the most important factors in the evolution of Big Data as it provides
information about the people’s behavior. You can look at the figure below and get an idea how
much data is getting generated every minute
Fig: Hadoop Tutorial – Social Media Data Generation Stats

Apart from the rate at which the data is getting generated, the second factor is the lack of proper
format or structure in these data sets that makes processing a challenge.

Hadoop Use Case 1: Big Data & Hadoop – Restaurant Analogy


Let us take an analogy of a restaurant to understand the problems associated with Big Data and
how Hadoop solved that problem.

Bob is a businessman who has opened a small restaurant. Initially, in his restaurant, he used to
receive two orders per hour and he had one chef with one food shelf in his restaurant which was
sufficient enough to handle all the orders.

Fig: Hadoop Tutorial – Traditional Restaurant Scenario

Now let us compare the restaurant example with the traditional scenario where data was getting
generated at a steady rate and our traditional systems like RDBMS is capable enough to handle
it, just like Bob’s chef. Here, you can relate the data storage with the restaurant’s food shelf and
the traditional processing unit with the chef as shown in the figure above.
Fig: Hadoop Tutorial – Traditional Scenario

After few months, Bob thought of expanding his business and therefore, he started taking online
orders and added few more cuisines to the restaurant’s menu in order to engage a larger
audience. Because of this transition, the rate at which they were receiving orders rose to an
alarming figure of 10 orders per hour and it became quite difficult for a single cook to cope up
with the current situation. Aware of the situation in processing the orders, Bob started thinking
about the solution.
Big Data Hadoop Certification Training

 Instructor-led Sessions
 Real-life Case Studies
 Assignments
 Lifetime Access

Explore Curriculum

Fig: Hadoop Tutorial – Distributed Processing Scenario

Similarly, in Big Data scenario, the data started getting generated at an alarming rate because of
the introduction of various data growth drivers such as social media, smartphones etc. Now, the
traditional system, just like cook in Bob’s restaurant, was not efficient enough to handle this
sudden change. Thus, there was a need for a different kind of solutions strategy to cope up with
this problem.

After a lot of research, Bob came up with a solution where he hired 4 more chefs to tackle the
huge rate of orders being received. Everything was going quite well, but this solution led to one
more problem. Since four chefs were sharing the same food shelf, the very food shelf was
becoming the bottleneck of the whole process. Hence, the solution was not that efficient as Bob
thought.
Fig: Hadoop Tutorial – Distributed Processing Scenario Failure

Similarly, to tackle the problem of processing huge datasets, multiple processing units were
installed so as to process the data parallelly (just like Bob hired 4 chefs). But even in this case,
bringing multiple processing units was not an effective solution because: the centralized storage
unit became the bottleneck. In other words, the performance of the whole system is driven by the
performance of the central storage unit. Therefore, the moment our central storage goes down,
the whole system gets compromised. Hence, again there was a need to resolve this single point of
failure.

Fig: Hadoop Tutorial – Solution to Restaurant Problem

Bob came up with another efficient solution, he divided all the chefs in two hierarchies, i.e.
junior and head chef and assigned each junior chef with a food shelf. Let us assume that the dish
is Meat Sauce. Now, according to Bob’s plan, one junior chef will prepare meat and the other
junior chef will prepare the sauce. Moving ahead they will transfer both meat and sauce to the
head chef, where the head chef will prepare the meat sauce after combining both the ingredients,
which then will be delivered as the final order.
Fig: Hadoop Tutorial – Hadoop in Restaurant Analogy

Hadoop functions in a similar fashion as Bob’s restaurant. As the food shelf is distributed in
Bob’s restaurant, similarly, in Hadoop, the data is stored in a distributed fashion with
replications, to provide fault tolerance. For parallel processing, first the data is processed by the
slaves where it is stored for some intermediate results and then those intermediate results are
merged by master node to send the final result.

Now, you must have got an idea why Big Data is a problem statement and how Hadoop solves
it. As we just discussed above, there were three major challenges with Big Data:

 The first problem is storing the colossal amount of data. Storing huge data in a
traditional system is not possible. The reason is obvious, the storage will be limited to
one system and the data is increasing at a tremendous rate.
 The second problem is storing heterogeneous data. Now we know that storing is a
problem, but let me tell you it is just one part of the problem. The data is not only huge,
but it is also present in various formats i.e. unstructured, semi-structured and structured.
So, you need to make sure that you have a system to store different types of data that is
generated from various sources.
 Finally let’s focus on the third problem, which is the processing speed. Now the time
taken to process this huge amount of data is quite high as the data to be processed is too
large.

Hadoop Use Case 2: Last.fm Case Study


Last.fm is internet radio and community-driven music discovery service founded in 2002. Users
transmit information to Last.fm servers indicating which songs they are listening to. The
received data is processed and stored so that, the user can access it in the form of charts. Thus,
Last.fm can make intelligent taste and compatible decisions for generating recommendations.
The data is obtained from one of the two sources stated below:

 scrobble: When a user plays a track of his or her own choice and sends the information
to Last.fm through a client application.
 radio listen: When the user tunes into a Last.fm radio station and streams a song.

Last.fm applications allow users to love, skip or ban each track they listen to. This track listening
data is also transmitted to the server.

 Over 40M unique visitors and 500M page views each month
 Scrobble stats:
o Up to 800 scrobbles per second
o More than 40 million scrobbles per day
o Over 75 billion scrobbles so far
 Radio stats:
o Over 10 million streaming hours per month
o Over 400 thousand unique stations per day
 Each scrobble and radio listen generates at least one log line

Hadoop at Last.FM

 100 Nodes
 8 cores per node (dual quad-core)
 24GB memory per node
 8TB (4 disks of 2TB each)
 Hive integration to run optimized SQL queries for analysis

Last.FM started using Hadoop in 2006 because of the growth in users from thousands to
millions. With the help of Hadoop they processed hundreds of daily, monthly, and weekly jobs
including website stats and metrics, chart generation (i.e. track statistics), metadata corrections
(e.g. misspellings of artists), indexing for search, combining/formatting data for
recommendations, data insights, evaluations & reporting. This helped Last.FM to grow
tremendously and figure out the taste of their users, based on which they started recommending
musics.

Hadoop Tutorial: What is Hadoop?


Hadoop is an open-source software framework used for storing and processing Big Data in a
distributed manner on large clusters of commodity hardware. Hadoop is licensed under the
Apache v2 license. Hadoop was developed, based on the paper written by Google on MapReduce
system and it applies concepts of functional programming. Hadoop is written in the Java
programming language and ranks among the highest-level Apache projects. Hadoop was
developed by Doug Cutting and Michael J. Cafarella.

Hadoop Tutorial: Hadoop-as-a-Solution


Let’s understand how Hadoop provides solution to the Big Data problems that we have discussed
so far.

Fig: Hadoop Tutorial – Hadoop-as-a-Solution

The first problem is storing huge amount of data.


As you can see in the above image, HDFS provides a distributed way to store Big Data. Your
data is stored in blocks in DataNodes and you specify the size of each block. Suppose you have
512MB of data and you have configured HDFS such that it will create 128 MB of data blocks.
Now, HDFS will divide data into 4 blocks as 512/128=4 and stores it across different
DataNodes. While storing these data blocks into DataNodes, data blocks are replicated on
different DataNodes to provide fault tolerance.

Hadoop follows horizontal scaling instead of vertical scaling. In horizontal scaling, you can add
new nodes to HDFS cluster on the run as per requirement, instead of increasing the hardware
stack present in each node.

Next problem was storing the variety of data.

As you can see in the above image, in HDFS you can store all kinds of data whether it is
structured, semi-structured or unstructured. In HDFS, there is no pre-dumping schema
validation. It also follows write once and read many model. Due to this, you can just write any
kind of data once and you can read it multiple times for finding insights.

The third challenge was about processing the data faster.

In order to solve this, we move processing unit to data instead of moving data to processing unit.
So, what does it mean by moving the computation unit to data? It means that instead of moving
data from different nodes to a single master node for processing, the processing logic is sent to
the nodes where data is stored so as that each node can process a part of data in parallel. Finally,
all of the intermediary output produced by each node is merged together and the final response is
sent back to the client.

Hadoop Features
When machines are working in tandem, if one of the machines fails, another machine will take
over the responsibility and work in a reliable and fault tolerant fashion. Hadoop infrastructure
has inbuilt fault tolerance features and hence, Hadoop is highly reliable.

Economical:

Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop
cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-10 TB
hard disk and Xeon processors, but if I would have used hardware-based RAID with Oracle for
the same purpose, I would end up spending 5x times more at least. So, the cost of ownership of a
Hadoop-based project is pretty minimized. It is easier to maintain the Hadoop environment and
is economical as well. Also, Hadoop is an open source software and hence there is no licensing
cost.

Scalability:

Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if you
are installing Hadoop on a cloud, you don’t need to worry about the scalability factor because
you can go ahead and procure more hardware and expand your setup within minutes whenever
required.

Flexibility:

Hadoop is very flexible in terms of ability to deal with all kinds of data. We discussed
“Variety” in our previous blog on Big Data Tutorial, where data can be of any kind and Hadoop
can store and process them all, whether it is structured, semi-structured or unstructured data.

These 4 characteristics make Hadoop a front-runner as a solution to Big Data challenges. Now
that we know what is Hadoop, we can explore the core components of Hadoop. Let us
understand, what are the core components of Hadoop.

Hadoop Core Components


While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of
your Hadoop platform, but there are two services which are always mandatory for setting
up Hadoop. One is HDFS (storage) and the other is YARN (processing). HDFS stands
for Hadoop Distributed File System, which is a scalable storage unit of Hadoop whereas
YARN is used to process the data i.e. stored in the HDFS in a distributed and parallel fashion.

HDFS :

Let us go ahead with HDFS first. The main components


of HDFS are: NameNode and DataNode. Let us talk about the roles of these two
components in detail.

Fig: Hadoop Tutorial – HDFS

NameNode

 It is the master daemon that maintains and manages the DataNodes (slave nodes)
 It records the metadata of all the blocks stored in the cluster, e.g. location of blocks
stored, size of the files, permissions, hierarchy, etc.
 It records each and every change that takes place to the file system metadata
 If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
 It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster
to ensure that the DataNodes are live
 It keeps a record of all the blocks in the HDFS and DataNode in which they are stored
 It has high availability and federation features which I will discuss
in HDFS architecture in detail

DataNode

 It is the slave daemon which run on each slave machine


 The actual data is stored on DataNodes
 It is responsible for serving read and write requests from the clients
 It is also responsible for creating blocks, deleting blocks and replicating the same based
on the decisions taken by the NameNode
 It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by
default, this frequency is set to 3 seconds

So, this was all about HDFS in nutshell. Now, let move ahead to our second fundamental unit of
Hadoop i.e. YARN.

YARN

YARN comprises of two major component: ResourceManager and NodeManager.

Fig: Hadoop Tutorial – YARN


ResourceManager

 It is a cluster level (one for each cluster) component and runs on the master machine
 It manages resources and schedule applications running on top of YARN
 It has two components: Scheduler & ApplicationManager
 The Scheduler is responsible for allocating resources to the various running applications
 The ApplicationManager is responsible for accepting job submissions and negotiating the
first container for executing the application
 It keeps a track of the heartbeats from the Node Manager

NodeManager

 It is a node level component (one on each node) and runs on each slave machine
 It is responsible for managing containers and monitoring resource utilization in each
container
 It also keeps track of node health and log management
 It continuously communicates with ResourceManager to remain up-to-date

Hadoop Tutorial: Hadoop Ecosystem


So far you would have figured out that Hadoop is neither a programming language nor a service,
it is a platform or framework which solves Big Data problems. You can consider it as a suite
which encompasses a number of services for ingesting, storing and analyzing huge data sets
along with tools for configuration management.

Fig: Hadoop Tutorial – Hadoop Ecosystem

We have discussed Hadoop Ecosystem and their components in detail in our Hadoop Ecosystem
blog. Now in this Hadoop Tutorial, let us know how Last.fm used Hadoop as a part of their
solution strategy.
Questions for you:

1. Give examples of Hadoop installation softwares

2. Explain a use case of Hadoop

Eucalyptus Basics:

Please refer to your notes

4. Hadoop Installation:

Please refer to content sent in Google Classroom or the below link

https://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

5. Python Wordcount Program on Local Computer:

Refer to your notes

6. Python Word Count using MapReduce:

Have a look at the example in $HADOOP_HOME/src/examples/python/WordCount.py and you


see what I mean.

Python MapReduce Code

The “trick” behind the following Python code is that we will use the Hadoop Streaming API for
helping us passing data between our Map and Reduce code via STDIN (standard input)
and STDOUT (standard output). We will simply use Python’s sys.stdin to read input data and
print our own output to sys.stdout . That’s all we need to do because Hadoop Streaming will take
care of everything else!

Map step: mapper.py

Save the following code in the file /home/hduser/mapper.py . It will read data from STDIN , split
it into words and output a list of lines mapping words to their (intermediate) counts
to STDOUT . The Map script will not compute an (intermediate) sum of a word’s occurrences
though. Instead, it will output <word> 1 tuples immediately – even though a specific word
might occur multiple times in the input. In our case we let the subsequent Reduce step do the
final sum count. Of course, you can change this behavior in your own scripts as you please, but
we will keep it like that in this tutorial because of didactic reasons. :-)

Make sure the file has execution permission ( chmod +x /home/hduser/mapper.py should do the
trick) or you will run into problems.

#!/usr/bin/env python

"""mapper.py"""

import sys

# input comes from STDIN (standard input)

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# split the line into words

words = line.split()

# increase counters

for word in words:

# write the results to STDOUT (standard output);

# what we output here will be the input for the

# Reduce step, i.e. the input for reducer.py

# tab-delimited; the trivial word count is 1

print '%s\t%s' % (word, 1)

Reduce step: reducer.py

Save the following code in the file /home/hduser/reducer.py . It will read the results
of mapper.py from STDIN (so the output format of mapper.py and the expected input format
of reducer.py must match) and sum the occurrences of each word to a final count, and then
output its results to STDOUT .

Make sure the file has execution permission ( chmod +x /home/hduser/reducer.py should do the
trick) or you will run into problems.

#!/usr/bin/env python

"""reducer.py"""

from operator import itemgetter

import sys

current_word = None

current_count = 0

word = None

# input comes from STDIN

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# parse the input we got from mapper.py

word, count = line.split('\t', 1)

# convert count (currently a string) to int

try:

count = int(count)

except ValueError:
# count was not a number, so silently

# ignore/discard this line

continue

# this IF-switch only works because Hadoop sorts map output

# by key (here: word) before it is passed to the reducer

if current_word == word:

current_count += count

else:

if current_word:

# write result to STDOUT

print '%s\t%s' % (current_word, current_count)

current_count = count

current_word = word

# do not forget to output the last word if needed!

if current_word == word:

print '%s\t%s' % (current_word, current_count)

Test your code (cat data | map | sort | reduce)

I recommend to test your mapper.py and reducer.py scripts locally before using them in a
MapReduce job. Otherwise your jobs might successfully complete but there will be no job result
data at all or not the results you would have expected. If that happens, most likely it was you (or
me) who screwed up.

Here are some ideas on how to test the functionality of the Map and Reduce scripts.

# Test mapper.py and reducer.py locally first


# very basic test

hduser@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py

foo 1

foo 1

quux 1

labs 1

foo 1

bar 1

quux 1

hduser@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py | sort -k1,1 |
/home/hduser/reducer.py

bar 1

foo 3

labs 1

quux 2

# using one of the ebooks as example input

# (see below on where to get the ebooks)

hduser@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hduser/mapper.py

The 1

Project 1

Gutenberg 1

EBook 1

of 1

[...]
(you get the idea)

Running the Python Code on Hadoop

Download example input data

We will use three ebooks from Project Gutenberg for this example:

 The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson


 The Notebooks of Leonardo Da Vinci
 Ulysses by James Joyce

Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local
temporary directory of choice, for example /tmp/gutenberg .

hduser@ubuntu:~$ ls -l /tmp/gutenberg/

total 3604

-rw-r--r-- 1 hduser hadoop 674566 Feb 3 10:17 pg20417.txt

-rw-r--r-- 1 hduser hadoop 1573112 Feb 3 10:18 pg4300.txt

-rw-r--r-- 1 hduser hadoop 1423801 Feb 3 10:18 pg5000.txt

hduser@ubuntu:~$

Copy local example data to HDFS

Before we run the actual MapReduce job, we must first copy the files from our local file system
to Hadoop’s HDFS.

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg


/user/hduser/gutenberg

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls

Found 1 items

drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg

Found 3 items

-rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt


-rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt

-rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt

hduser@ubuntu:/usr/local/hadoop$

Run the MapReduce job

Now that everything is prepared, we can finally run our Python MapReduce job on the Hadoop
cluster. As I said above, we leverage the Hadoop Streaming API for helping us passing data
between our Map and Reduce code via STDIN and STDOUT .

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \

-file /home/hduser/mapper.py -mapper /home/hduser/mapper.py \

-file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \

-input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

If you want to modify some Hadoop settings on the fly like increasing the number of Reduce
tasks, you can use the -D option:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -D


mapred.reduce.tasks=16 ...

Note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a
hint. But it accepts the user specified mapred.reduce.tasks and doesn't manipulate that. You
cannot force mapred.map.tasks but can specify mapred.reduce.tasks.
The job will read all the files in the HDFS directory /user/hduser/gutenberg , process it, and store
the results in the HDFS directory /user/hduser/gutenberg-output . In general Hadoop will create
one output file per reducer; in our case however it will only create a single file because the input
files are very small.

Example output of the previous command in the console:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -


mapper /home/hduser/mapper.py -reducer /home/hduser/reducer.py -input
/user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

additionalConfSpec_:null

null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/app/hadoop/tmp/hadoop-unjar54543/]

[] /tmp/streamjob54544.jar tmpDir=null

[...] INFO mapred.FileInputFormat: Total input paths to process : 7

[...] INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]

[...] INFO streaming.StreamJob: Running job: job_200803031615_0021

[...]

[...] INFO streaming.StreamJob: map 0% reduce 0%

[...] INFO streaming.StreamJob: map 43% reduce 0%

[...] INFO streaming.StreamJob: map 86% reduce 0%

[...] INFO streaming.StreamJob: map 100% reduce 0%

[...] INFO streaming.StreamJob: map 100% reduce 33%

[...] INFO streaming.StreamJob: map 100% reduce 70%

[...] INFO streaming.StreamJob: map 100% reduce 77%

[...] INFO streaming.StreamJob: map 100% reduce 100%

[...] INFO streaming.StreamJob: Job complete: job_200803031615_0021

[...] INFO streaming.StreamJob: Output: /user/hduser/gutenberg-output

hduser@ubuntu:/usr/local/hadoop$

As you can see in the output above, Hadoop also provides a basic web interface for statistics and
information. When the Hadoop cluster is running, open http://localhost:50030/ in a browser and
have a look around. Here’s a screenshot of the Hadoop web interface for the job we just ran.
Figure 1: A screenshot of Hadoop's JobTracker web interface, showing the details of the
MapReduce job we just ran

Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output :

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output

Found 1 items

/user/hduser/gutenberg-output/part-00000 &lt;r 1&gt; 903193 2007-09-21 13:00

hduser@ubuntu:/usr/local/hadoop$
You can then inspect the contents of the file with the dfs -cat command:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-


00000

"(Lo)cra" 1

"1490 1

"1498," 1

"35" 1

"40," 1

"A 2

"AS-IS". 2

"A_ 1

"Absoluti 1

[...]

hduser@ubuntu:/usr/local/hadoop$

Note that in this specific output above the quote signs ( " ) enclosing the words have not been
inserted by Hadoop. They are the result of how our Python code splits words, and in this case it
matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to
see it for yourself.

Vous aimerez peut-être aussi