Apache Flume Advanced

APACHE FLUME ADVANCED
Vishwaraj Bangari
Table of Contents
Spooling Directory …………………..………………………………………..……………………………………………………………….…..2
Multiplexing the Flow ………………………………….………………………………………………………………….……………………..8
Flume data accessed by hive tables ………………………………………………………………………………………………………13
1|Page Vishwaraj
Spooling Directory
Here the source lets you to ingest data by placing files into a “Spooling” directory on disk. This
source will watch the specified directory for the new files and will parse events out of new files as
they appear and this process is pluggable. After a given file has been fully read into the channel, it
is renamed as to indicate completed (or optionally deleted).
Step 1: Create a spooldir and place input files in it
Step 2: Create a configuration file of spooldir in your local directory as
2|Page Vishwaraj
3|Page Vishwaraj
Step 3: Run the agent as below
Step 4: You will get this screen
Step 5: Check in another putty the output path.
The input files will be combined together as one file in output path.
4|Page Vishwaraj
Step 6: Check the local input path
Where the file as suffixed as .COMPLETED which are transferred successfully.
Step 7: While agent is running continuously create a new file in local path as follows
After creation of new file the agent will load the new created file and we will get as below screen
shot.
5|Page Vishwaraj
Check the output path for the successful transfer of the file.
Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or
killed. In exchange for this reliability, only immutable, uniquely-named files must be dropped into
the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they
are violated:
1. If a file is written to after being placed into the spooling directory, Flume will print an error
to its log file and stop processing.
2. If a file name is reused at a later time, Flume will print an error to its log file and stop
processing.
To avoid the above issues, it may be useful to add a unique identifier (such as a timestamp) to log
file names when they are moved into the spooling directory.
Step 8: When you try to create new input file same as already present file name agent will through
an error.
6|Page Vishwaraj
The same name file will not be parsed to output directory and it will be as below
7|Page Vishwaraj
Multiplexing the flow
Multiplexing the flow nothing but placing the one input file into two or more
destination points using same number of channels as of destination points.
Example:
Step 1: Create input file or source file in your local directory
Step 2: Create a configuration file in your local path at
/home/cts419866/flume/conf/ Multiplexingflume2.conf
Code:
agent.sources = vishi
agent.channels = mc1 mc2 mc3
agent.sinks = hdfs1 hdfs2 hdfs3
# Setting the source to spool directory where the file exists
agent.sources.vishi.type = exec
agent.sources.vishi.command = tail -F /home/cts419866/flume/input/multiinput.txt
agent.sources.vishi.channels = mc1 mc2 mc3
# Setting the channel to memory
8|Page Vishwaraj
agent.channels.mc1.type = memory
# Max number of events stored in the mc1
agent.channels.mc1.capacity = 100
agent.channels.mc1.transactioncapacity = 100
# Connect source and sink with channel

#First Sink
agent.sinks.hdfs1.channel = mc1
agent.sinks.hdfs1.type = hdfs
agent.sinks.hdfs1.hdfs.path = hdfs://aster2:8020/user/cts419866/flume
agent.sinks.hdfs1.hdfs.filePrefix = events
#agent.sinks.HDFS.hdfs.writeFormat = Text
agent.sinks.hdfs1.hdfs.fileType = DataStream
# rollover file based on maximum size of 10 MB
agent.sinks.hdfs1.hdfs.rollCount = 50
agent.sinks.hdfs1.hdfs.rollInterval = 0
agent.sinks.hdfs1.hdfs.rollSize = 0
agent.sinks.hdfs1.hdfs.batchSize = 10
#Second Sink
agent.sinks.hdfs2.hdfs.path = hdfs://aster2:8020/user/cts419866/flume/tail
9|Page Vishwaraj
## rollover file based on maximum size of 10 MB
#Third Sink
agent.sinks.hdfs3.hdfs.path = hdfs://aster2:8020/user/cts419866/flume/tail/events
### rollover file based on maximum size of 10 MB
Step 3: Run the agent
Step 4: Screen will appear like this.
10 | P a g e Vishwaraj
Step 5: Check the all output paths
First Sink:
Second Sink:
Third Sink:
At last three sinks are created at three different locations with same content of
input file.
Flume Data accessed by Hive Tables:
Step 1: Create a sample input file in your local directory.
Step 2: Create a configuration file with timestamp in your local directory.
The timestamp parameters are
Hdfs.round: Should the timestamp be rounded down (if true, affects all time based
escape sequences except %t)
Hdfs.roundValue: Rounded down to the highest multiple of this (in the unit
configured using hdfs.roundUnit), less than current time.
Hdfs.roundUnit: The unit of the round down value - second, minute or hour.
Step 3: Run the agent
On success screen will appear like this.
Step 4: Check the output directory and output file.
Check the output file.
Open hive
Step 5: Create an external table with partition as elocation
Step 6: Alter the table with already created data in flume as follows
Step 7: Display the table
The flume data is accessed through hive using external partition table.

Apache Flume Advanced

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Apache Flume Advanced

Transféré par

Droits d'auteur :

Formats disponibles

APACHE FLUME ADVANCED

Spooling Directory …………………..………………………………………..……………………………………………………………….…..2

Multiplexing the Flow ………………………………….………………………………………………………………….……………………..8

Flume data accessed by hive tables ………………………………………………………………………………………………………13

Step 1: Create a spooldir and place input files in it

Step 2: Create a configuration file of spooldir in your local directory as

Step 4: You will get this screen

Step 5: Check in another putty the output path.

Where the file as suffixed as .COMPLETED which are transferred successfully.

Step 1: Create input file or source file in your local directory

Step 2: Create a configuration file in your local path at

# Setting the source to spool directory where the file exists

# Setting the channel to memory

# Connect source and sink with channel

Step 3: Run the agent

Step 4: Screen will appear like this.

Step 2: Create a configuration file with timestamp in your local directory.

Step 3: Run the agent

On success screen will appear like this.

Check the output file.

Step 5: Create an external table with partition as elocation

Step 7: Display the table

Vous aimerez peut-être aussi