Vous êtes sur la page 1sur 17

APACHE FLUME ADVANCED

Vishwaraj Bangari
Table of Contents

Spooling Directory …………………..………………………………………..……………………………………………………………….…..2

Multiplexing the Flow ………………………………….………………………………………………………………….……………………..8

Flume data accessed by hive tables ………………………………………………………………………………………………………13

1|Page Vishwaraj
Spooling Directory

Here the source lets you to ingest data by placing files into a “Spooling” directory on disk. This
source will watch the specified directory for the new files and will parse events out of new files as
they appear and this process is pluggable. After a given file has been fully read into the channel, it
is renamed as to indicate completed (or optionally deleted).

Step 1: Create a spooldir and place input files in it

Step 2: Create a configuration file of spooldir in your local directory as

2|Page Vishwaraj
3|Page Vishwaraj
Step 3: Run the agent as below

Step 4: You will get this screen

Step 5: Check in another putty the output path.

The input files will be combined together as one file in output path.

4|Page Vishwaraj
Step 6: Check the local input path

Where the file as suffixed as .COMPLETED which are transferred successfully.

Step 7: While agent is running continuously create a new file in local path as follows

After creation of new file the agent will load the new created file and we will get as below screen
shot.

5|Page Vishwaraj
Check the output path for the successful transfer of the file.

Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or
killed. In exchange for this reliability, only immutable, uniquely-named files must be dropped into
the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they
are violated:

1. If a file is written to after being placed into the spooling directory, Flume will print an error
to its log file and stop processing.
2. If a file name is reused at a later time, Flume will print an error to its log file and stop
processing.

To avoid the above issues, it may be useful to add a unique identifier (such as a timestamp) to log
file names when they are moved into the spooling directory.

Step 8: When you try to create new input file same as already present file name agent will through
an error.

6|Page Vishwaraj
The same name file will not be parsed to output directory and it will be as below

7|Page Vishwaraj
Multiplexing the flow
Multiplexing the flow nothing but placing the one input file into two or more
destination points using same number of channels as of destination points.

Example:

Step 1: Create input file or source file in your local directory

Step 2: Create a configuration file in your local path at

/home/cts419866/flume/conf/ Multiplexingflume2.conf

Code:

agent.sources = vishi
agent.channels = mc1 mc2 mc3
agent.sinks = hdfs1 hdfs2 hdfs3

# Setting the source to spool directory where the file exists

agent.sources.vishi.type = exec
agent.sources.vishi.command = tail -F /home/cts419866/flume/input/multiinput.txt
agent.sources.vishi.channels = mc1 mc2 mc3

# Setting the channel to memory

8|Page Vishwaraj
agent.channels.mc1.type = memory
# Max number of events stored in the mc1
agent.channels.mc1.capacity = 100
agent.channels.mc1.transactioncapacity = 100

agent.channels.mc2.type = memory
# Max number of events stored in the mc2
agent.channels.mc2.capacity = 100
agent.channels.mc2.transactioncapacity = 100

agent.channels.mc3.type = memory
# Max number of events stored in the mc3
agent.channels.mc3.capacity = 100
agent.channels.mc3.transactioncapacity = 100

# Connect source and sink with channel


#First Sink

agent.sinks.hdfs1.channel = mc1
agent.sinks.hdfs1.type = hdfs

agent.sinks.hdfs1.hdfs.path = hdfs://aster2:8020/user/cts419866/flume
agent.sinks.hdfs1.hdfs.filePrefix = events
#agent.sinks.HDFS.hdfs.writeFormat = Text
agent.sinks.hdfs1.hdfs.fileType = DataStream
# rollover file based on maximum size of 10 MB
agent.sinks.hdfs1.hdfs.rollCount = 50
agent.sinks.hdfs1.hdfs.rollInterval = 0
agent.sinks.hdfs1.hdfs.rollSize = 0
agent.sinks.hdfs1.hdfs.batchSize = 10

#Second Sink

agent.sinks.hdfs2.channel = mc2
agent.sinks.hdfs2.type = hdfs

agent.sinks.hdfs2.hdfs.path = hdfs://aster2:8020/user/cts419866/flume/tail
agent.sinks.hdfs2.hdfs.filePrefix = events
#agent.sinks.HDFS.hdfs.writeFormat = Text

9|Page Vishwaraj
agent.sinks.hdfs2.hdfs.fileType = DataStream
## rollover file based on maximum size of 10 MB
agent.sinks.hdfs2.hdfs.rollCount = 50
agent.sinks.hdfs2.hdfs.rollInterval = 0
agent.sinks.hdfs2.hdfs.rollSize = 0
agent.sinks.hdfs2.hdfs.batchSize = 10

#Third Sink
agent.sinks.hdfs3.channel = mc3
agent.sinks.hdfs3.type = hdfs

agent.sinks.hdfs3.hdfs.path = hdfs://aster2:8020/user/cts419866/flume/tail/events
agent.sinks.hdfs3.hdfs.filePrefix = events
#agent.sinks.HDFS.hdfs.writeFormat = Text
agent.sinks.hdfs3.hdfs.fileType = DataStream
### rollover file based on maximum size of 10 MB
agent.sinks.hdfs3.hdfs.rollCount = 50
agent.sinks.hdfs3.hdfs.rollInterval = 0
agent.sinks.hdfs3.hdfs.rollSize = 0
agent.sinks.hdfs3.hdfs.batchSize = 10

Step 3: Run the agent

Step 4: Screen will appear like this.

10 | P a g e Vishwaraj
Step 5: Check the all output paths

First Sink:

Second Sink:

11 | P a g e Vishwaraj
Third Sink:

At last three sinks are created at three different locations with same content of
input file.

12 | P a g e Vishwaraj
Flume Data accessed by Hive Tables:
Step 1: Create a sample input file in your local directory.

Step 2: Create a configuration file with timestamp in your local directory.

13 | P a g e Vishwaraj
The timestamp parameters are

Hdfs.round: Should the timestamp be rounded down (if true, affects all time based
escape sequences except %t)

Hdfs.roundValue: Rounded down to the highest multiple of this (in the unit
configured using hdfs.roundUnit), less than current time.

Hdfs.roundUnit: The unit of the round down value - second, minute or hour.

Step 3: Run the agent

On success screen will appear like this.

14 | P a g e Vishwaraj
Step 4: Check the output directory and output file.

Check the output file.

Open hive

Step 5: Create an external table with partition as elocation

15 | P a g e Vishwaraj
Step 6: Alter the table with already created data in flume as follows

Step 7: Display the table

The flume data is accessed through hive using external partition table.

16 | P a g e Vishwaraj

Vous aimerez peut-être aussi