Introduction To Apache Sqoop and Flume

Big Data and Hadoop
Module 9: Introduction to Apache Sqoop and Flume

Why Scoop?
While companies across industries are trying to move data from structured relational databases like MySQL, Teradata,
Netezza, and so on to Hadoop, there were concerns on the ease of transitioning of existing data to Hadoop Systems.
User consideration:
Data consistency
Production system resource consumption
Data preparation
Transition Efficiency
©2018 EduNextgen www.edunextgen.com 2

What is Scoop?
Sqoop, an Apache Hadoop Ecosystem project, is a command-line interface application for transferring data between
relational databases and Hadoop. It supports incremental loads of a single table, or a free-form SQL query. Exports can be
used to put data from Hadoop into a relational database.

Sqoop and Its Uses
Sqoop is an Apache Hadoop Ecosystem project whose responsibility is to import or export operations across relational
databases. Reasons for using Sqoop are as follows:
SQL servers are deployed worldwide
Allows for moving certain parts of data from traditional SQL DB to Hadoop
Transferring data using script is inefficient and time-consuming, hence Sqoop is used
Handles large data through ecosystem
Brings processed data from Hadoop to the applications

Sqoop and Its Uses (Cont’d)
Sqoop is required when a database is imported from a Relational Database (RDB) to Hadoop or vice versa.
While importing database from RDB to While exporting database from Hadoop to
Hadoop: RDB:
Users must consider consistency of data, consumption of Users must keep in mind that directly accessing data
production system resources, and preparation of data for residing on external systems, within a MapReduce
provisioning downstream pipeline. framework complicates applications. It exposes the
production system to excessive load originating from
cluster nodes.

Benefits of Sqoop
The following are the benefits of using Sqoop:
Transfers data from Hadoop to RDB

Exports data back to RDB
and vice versa
Used to import data from an RDB to

Transforms data in Hadoop with
the Hadoop Distributed File System
MapReduce without extra coding
(HDFS)

Sqoop Processing
The processing of Sqoop can be summarized as follows:
It runs in the Hadoop Cluster
It imports data from RDB or NoSQL DB to Hadoop
It has access to the Hadoop core which helps in using mappers to slice the incoming data into parts and place the data in
HDFS
It exports data back into the RDB, ensuring that the schema of the data in the database is maintained

Importing Data Using Sqoop
To import data present in MySQL database using Sqoop, use the following command:

Exporting Data from Hadoop Using Sqoop
Use the following command to export data from Hadoop using Sqoop:

Sqoop Connectors
The different types of Sqoop connectors are:
Generic JDBC connector Used to connect to any database that is accessible via JDBC
Default Sqoop connector Designed for specific databases
Specialized to use specific batch tools transferring data with high

Fast-path connector
throughput

Sample Sqoop Commands
$sqoop import –driver com.mysql.jdbc.Driver –connect jdbc:mysql://localhost/a10 –username root –password root –table
sqoop_demo –target-dir /user/sqoop_batch6 – m 1 –as-textfile
sqoop_demo –target-dir /user/sqoop_batch6 – m 1 –as-textfile –where “id>2”
sqoop_demo –target-dir /user/sqoop_batch6 –e “select * from sqoop_demo where id =13” – m 1 –as-textfile

More Sample Sqoop Commands
$sqoop import –driver com.mysql.jdbc.Driver –connect jdbc:mysql://localhost/a10 –username root –password root –
table sqoop_demo –target-dir /user/sqoop_batch6 – m 1 –as-textfile –split-by id
$sqoop job –list
$sqoop export–connect jdbc:mysql://localhost/a2 –username root –password root –table report1 –export-dir
/user/hive/warehouse/mobile_vs_sentiments1/ --input-fields-terminated-by ‘\001’

Demo
Export Data Using Sqoop from Hadoop

Why Flume?
A business scenario in which Flume is beneficial:
A company has thousands of services running on different servers in a cluster that produce many
large logs; these services should be analyzed altogether.
Current scenario
Problem
Solution

Why Flume (Cont’d)
Consider the following case study:
The current issue involves determining how to send the logs to a setup that has Hadoop.
The channel or method used for the sending process must be reliable, scalable, extensive,
and manageable.
Current scenario
Problem
Solution

Why Flume (Cont’d)
Consider the following case study:
To solve this problem log aggregation tool called flume can be used. Apache Sqoop and Flume are
the tools that are used to gather data from different sources and load them into HDFS. Sqoop in
Hadoop is used to extract structured data from databases like Teradata, Oracle, and so on, whereas
Current scenario
Flume in Hadoop sources data that is stored in different sources, and deals with unstructured data.
Problem
Solution

Apache Flume – Introduction
Apache Flume is a distributed and reliable service for efficiently collecting, aggregating, and moving large amounts of
streaming data into the Hadoop Distributed File System (HDFS)
It has a simple and flexible architecture based on streaming data flows, which is robust and fault-tolerant

Flume Agent
An agent is an independent daemon process (JVM) in Flume
An agent is a collection of Sources, Sinks and Channels
It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent
Flume may have more than one agent. Following diagram represents a Flume Agent

Architecture – Components Of FLUME

Flume Configuration
After we have installed Flume, we need to configure it
For this we use the configuration file which is a Java property file having <key-value) pairs
We need to pass values to the keys in the file
In the Flume configuration file we need to:
Name the components of the current agent
Describe and Configure the source
Describe and Configure the sink
Describe and Configure the channel
Bind the source and the sink to the channel
In a typical Flume deployment, we may have multiple agents. We therefore use a unique name to differentiate and
configure each agent

Demo
Configure and Run Flume Agents

Introduction To Apache Sqoop and Flume

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Introduction To Apache Sqoop and Flume

Transféré par

Droits d'auteur :

Formats disponibles

Big Data and Hadoop

Module 9: Introduction to Apache Sqoop and Flume

Production system resource consumption

©2018 EduNextgen www.edunextgen.com 2

©2018 EduNextgen www.edunextgen.com 3

SQL servers are deployed worldwide

Handles large data through ecosystem

Brings processed data from Hadoop to the applications

©2018 EduNextgen www.edunextgen.com 4

©2018 EduNextgen www.edunextgen.com 5

The following are the benefits of using Sqoop:

Transfers data from Hadoop to RDB

Used to import data from an RDB to

©2018 EduNextgen www.edunextgen.com 6

The processing of Sqoop can be summarized as follows:

It runs in the Hadoop Cluster

It imports data from RDB or NoSQL DB to Hadoop

©2018 EduNextgen www.edunextgen.com 7

©2018 EduNextgen www.edunextgen.com 8

©2018 EduNextgen www.edunextgen.com 9

The different types of Sqoop connectors are:

Default Sqoop connector Designed for specific databases

Specialized to use specific batch tools transferring data with high

©2018 EduNextgen www.edunextgen.com 10

©2018 EduNextgen www.edunextgen.com 11

$sqoop job –list

©2018 EduNextgen www.edunextgen.com 12

Export Data Using Sqoop from Hadoop

©2018 EduNextgen www.edunextgen.com 13

A business scenario in which Flume is beneficial:

©2018 EduNextgen www.edunextgen.com 14

Consider the following case study:

©2018 EduNextgen www.edunextgen.com 15

Consider the following case study:

©2018 EduNextgen www.edunextgen.com 16

©2018 EduNextgen www.edunextgen.com 17

An agent is an independent daemon process (JVM) in Flume

An agent is a collection of Sources, Sinks and Channels

©2018 EduNextgen www.edunextgen.com 18

©2018 EduNextgen www.edunextgen.com 19

After we have installed Flume, we need to configure it

We need to pass values to the keys in the file

In the Flume configuration file we need to:

Name the components of the current agent

Describe and Configure the source

Describe and Configure the sink

Describe and Configure the channel

Bind the source and the sink to the channel

©2018 EduNextgen www.edunextgen.com 20

Configure and Run Flume Agents

©2018 EduNextgen www.edunextgen.com 21

Vous aimerez peut-être aussi