Académique Documents
Professionnel Documents
Culture Documents
cover
Front cover
InfoSphere BigInsights
Foundation
(Course code DW612)
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corp., registered in many jurisdictions worldwide.
The following are trademarks of International Business Machines Corporation, registered in
many jurisdictions worldwide:
BigInsights
GPFS
InfoSphere
WebSphere
Cognos
Guardium
Notes
DB2
Informix
Power
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
both.
Windows is a trademark of Microsoft Corporation in the United States, other countries, or
both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks
of Oracle and/or its affiliates.
VMware and the VMware "boxes" logo and design, Virtual SMP and VMotion are registered
trademarks or trademarks (the "Marks") of VMware, Inc. in the United States and/or other
jurisdictions.
Netezza is a trademark or registered trademark of IBM International Group B.V., an IBM
Company.
Other product and service names might be trademarks of IBM or other companies.
V8.1
Instructor Exercises Guide
TOC
Contents
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Instructor exercises overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Exercises configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Exercises description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Exercise 1. Exploring Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 1: Configure your image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 2: Start BigInsights components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 3: Using the Hadoop command line interface . . . . . . . . . . . . . . . . . . . . . .
Section 4: Running a sample MapReduce program from the command line . . . . .
Section 5: Using the Hadoop web interface to check startup status . . . . . . . . . . .
Section 6: Work with the BigInsights Web Console . . . . . . . . . . . . . . . . . . . . . . . .
Section 7: Cluster Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 8: Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 9: Application Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-1
1-2
1-3
1-3
1-4
1-5
1-6
1-6
1-6
1-7
Exercise 2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part 1: Basic MapReduce development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 1: Define a Java project in Eclipse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 2: Create a Java package and the mapper class . . . . . . . . . . . . . . . . . . .
Section 3: Complete the mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 4: Create the reducer class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 5: Complete the reducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 6: Create the driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 7: Create a JAR file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 8: Running your application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 9: Add a combiner function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 10: Recreate the JAR file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 11: Running your application again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 1: Create the MapReduce templates. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-1
1-2
1-2
1-3
1-3
1-4
1-4
1-5
1-6
1-6
1-6
1-7
1-7
1-7
1-1
1-2
1-2
1-3
1-4
1-1
1-2
1-2
1-6
Contents
iii
iv
V8.1
Instructor Exercises Guide
TMK
Trademarks
The reader should recognize that the following terms, which appear in the content of this
training document, are official trademarks of IBM or other companies:
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corp., registered in many jurisdictions worldwide.
The following are trademarks of International Business Machines Corporation, registered in
many jurisdictions worldwide:
BigInsights
GPFS
InfoSphere
WebSphere
Cognos
Guardium
Notes
DB2
Informix
Power
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
both.
Windows is a trademark of Microsoft Corporation in the United States, other countries, or
both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks
of Oracle and/or its affiliates.
VMware and the VMware "boxes" logo and design, Virtual SMP and VMotion are registered
trademarks or trademarks (the "Marks") of VMware, Inc. in the United States and/or other
jurisdictions.
Netezza is a trademark or registered trademark of IBM International Group B.V., an IBM
Company.
Other product and service names might be trademarks of IBM or other companies.
Trademarks
vi
V8.1
Instructor Exercises Guide
pref
vii
viii
V8.1
Instructor Exercises Guide
pref
Exercises configuration
Each student has a separate system and students work
independently.
Exercises configuration
ix
V8.1
Instructor Exercises Guide
pref
Exercises description
In the exercise instructions, you can check off the line before each
step as you complete it to track your progress.
Most exercises include required sections which should always be
completed. It might be necessary to complete these sections before
you can start later exercises. Some exercises might also include
optional sections that you might want to complete if you have sufficient
time and want an extra challenge.
Exercises description
xi
xii
V4.1
Instructor Exercises Guide
EXempty
Requirements
Requires the DW612 lab images.
1-1
Exercise instructions
Preface
Log in to your image with a userid of biadmin and a password of ibm2blue. If you need to
do something with root, the password is dalvm3.
V4.1
Instructor Exercises Guide
EXempty
exit
1-3
would have created the GutenbergDocs directory in HDFS and copied the files from the
local file system to HDFS.
__ 13. Monitor the logging information. When the job completes, view the results:
hadoop fs -ls countout
The part-r-ooooo file is the one that you want to view.
1-4
V4.1
Instructor Exercises Guide
EXempty
1-5
__ 23. The Log hyperlink allows you to view this TaskTrackers log files.
__ 24. The Job Tracker History hyperlink displays jobs that this jobtracker previously ran.
__ 25. Back on the Job Tracker administration page, in the upper right corner, are some
quick links to navigate the JobTracker.
Section 8: Files
__ 9. Click the Files tab.
__ 10. On the left side, expand hdfs://localhost.localdomain:9000/->user->biadmin.
1-6
V4.1
Instructor Exercises Guide
EXempty
__ 11. Select biadmin. The icons are the top of this frame are now enabled.
__ 12. Create a new directory under biadmin. Select the Create Directory icon.
__ 13. Type in a name of exercise1. Click OK. The newly created directory should now be
selected.
__ 14. Click the Upload icon.
__ 15. Click Browse. Navigate to File System->/home/labfiles/DW61/GutenbergDocs.
__ 16. Select walden.txt and click Open. Note that the file is added to a list. If you wanted
to upload additional files, browse and select them. They are added to the list. The
files in the list will be uploaded when you click OK. So click OK.
__ 17. Expand the exercise1 directory. If the uploaded file does not appear, then right-click
exercise1 and select Refresh.
__ 18. Select walden.txt. The selected file is displayed on the right side.
__ 19. Select the Rename icon.
__ 20. Rename the file to waldenpond.txt and click OK.
__ 21. Select the Delete icon. Click Yes to remove the file.
__ 22. Click the black box icon (to the left of the refresh icon). This will open a Hadoop fs
shell.
__ 23. In the input area, type the following. Then click Submit.
hadoop fs -rmr exercise1
__ 24. Verify that the directory was removed. In the shell dialog type:
hadoop fs -ls
__ 25. Close the shell dialog.
__ 26. The exercise1 directory might still be listed. Select the Refresh icon.
End of exercise
Copyright IBM Corp. 2012, 2013
1-7
1-8
V4.1
Instructor Exercises Guide
EXempty
Exercise 2. MapReduce
Estimated time
1:00
Requirements
Requires the DW612 lab images.
Exercise 2. MapReduce
1-1
Exercise instructions
You will use the BIgInsights Eclipse development environment to create a MapReduce
application. You have a file that contains the average monthly temperature from the three
reporting station in Sumner County, Tennessee for the year 2012. Your goal is to find the
maximum average temperature for each month in 2012. The mapper and reducer will emit
key / value pairs that contain the month and the temperature. (In this data, the temperature
values are three-digit integers. They are the average value multiplied by 10.)
1-2
V4.1
Instructor Exercises Guide
EXempty
java.io.IOException;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.Mapper;
Exercise 2. MapReduce
1-3
__ 15. In the map method, add the following code (or whatever code you think is required):
String line = value.toString();
String month = line.substring(22,24);
int avgTemp;
avgTemp = Integer.parseInt(line.substring(95,98));
context.write(new Text(month), new IntWritable(avgTemp));
__ 16. Save your work.
V4.1
Instructor Exercises Guide
EXempty
Exercise 2. MapReduce
1-5
job.setJarByClass(MaxMonthTemp.class);
job.setMapperClass(MaxTempMapper.class);
job.setReducerClass(MaxTempReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(programArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(programArgs[1]));
// Submit the job and wait for it to finish.
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
__ 24. Save your work.
V4.1
Instructor Exercises Guide
EXempty
__ 33. Look at the code for MaxMonthlyTemp.java. Add the following statement after the
job.setMaperClass(MaxTempMapper.class); statement.
job.setCombinerClass(MaxTempReducer.class);
__ 34. save your work.
Exercise 2. MapReduce
1-7
1-8
V4.1
Instructor Exercises Guide
EXempty
End of exercise
Exercise 2. MapReduce
1-9
V4.1
Instructor Exercises Guide
EXempty
Requirements
Requires the DW612 lab images.
1-1
Exercise instructions
Start all BigInsight components.
__ 1. Open a terminal window. Right-click the desktops background and choose Open in
Terminal.
__ 2. Change directory to the BigInsights bin directory.
cd $BIGINSIGHTS_HOME/bin
__ 3. You have previously started some BigInsights components individually. You can,
however, start all of the components using the start-all script. Execute the following:
./start-all.sh
__ 4. Once the console starts, open Firefox. Either click the icon on the desktop or from a
command line type:
firefox &
__ 5. Start the Web Console. In the address field, specify a URL of http://ibmclass:8080.
__ 6. Log in with a User Name of biadmin and a Password of ibm2blue.
1-2
V4.1
Instructor Exercises Guide
EXempty
Note
After you fill in the node information, you click the Add pushbutton. That will then only add
the node(s) to the list of New Nodes at the bottom of the dialog. The nodes will not be
created until you click the Save pushbutton.
__ 12. Since you are only running a a single node cluster (can you call a single node a
cluster?), you are not going to actually add a new node. So click the Cancel
pushbutton.
1-3
Note
Dashboards are unique to individual users. They cannot be viewed by another user.
1-4
V4.1
Instructor Exercises Guide
EXempty
This is setting numOfCores to six. This would be fine if all of the nodes in your cluster
had six cores. But what if some have six and others have eight? Indicating that the
nodes with eight cores only have six means that you are limiting the number of running
tasks. If you set numOfCores to eight, then you would be over-committing the six core
systems.
You can solve this problem using groups.
__ 12. Define two groups as follows and place them before the variable definitions.
group: grpCore6=dummyhost1
group: grpCore8=dummyhost2,ibmclass
What did this accomplish? Associated with group, grpCore6, is one system with a
hostname of dummyhost1. Associated with group, grpCore8, are two systems with
hostnames of dummyhost2 and ibmclass.
__ 13. Change the value for the numOfCores to two
var: numOfCores=2
__ 14. Follow that statement with
@grpCore6: numOfCores=6
@grpCore8: numOfCores=8
__ 15. Save your work and close the file.
What will this accomplish? When the mapred-site.xml file that is in the staging directory
gets push to a node in the cluster, the hostname of that node is checked to see if it
belongs to a group. If it belongs to grpCore6, then numOfCores is set to a value of 6. If
the hostname belongs to grpCore8, then numOfCores will be set to 8. If the hostname
does not belong to any groups, then the value of numOfCores stays as being 2.
The numOfCores value is used in the JSP formulas to arrive at the desired value for the
maximum number of mapper and reducer tasks.
__ 16. Synchronize the Hadoop configuration changes across your cluster. From a
command line type:
cd $BIGINSIGHTS_HOME/bin
./syncconf.sh hadoop
__ 17. In gedit, open mapred-site.xml that is in
File System->opt->ibm->biginsights->hadoop-conf. Note that the maximum
mapper tasks value is now 8 and the maximum reducer tasks value is now 4.
__ 18. Close mapred-site.xml.
__ 19. Again, open ibm-hadoop.properties that is in File
System->opt->ibm->biginsights->hdm->hadoop-conf-staging and move
ibmclass from grpCore8 to grpCore6.
group: grpCore6=dummyhost1,ibmclass
group: grpCore8=dummyhost2
Copyright IBM Corp. 2012, 2013
1-5
End of exercise
1-6
V4.1
Instructor Exercises Guide
EXempty
Requirements
Requires the DW612 lab images.
1-1
Exercise instructions
Unfortunately the first portion of this lab is going to be a typing exercises. The second part
uses the BigInsights workflow editor to simplify things. The goal is to create a workflow to
invoke the Word Count MapReduce application that comes as a sample application with
BigInsights.
Then if things go well, you continue by creating a coordinator flow that invokes the created
workflow based upon a schedule.
Note
The full text of both the workflow and coordinator code appears at the end of this exercise.
V4.1
Instructor Exercises Guide
EXempty
__ 5. After the <start to=wordcount/> statement, code the first action node. Since you
coded start to=wordcount />, the name of this action is wordcount. It follows the
start to... statement. This action node invokes a MapReduce application and upon
successful completion, it is to end normally. If this action node fails, it is to invoke the
kill node that you just coded.
<action name="wordcount">
<map-reduce>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
__ 6. Within the <map-reduce> element, you need to specify your JobTracker and
NameNode. In this exercise, refer to your JobTracker and NameNode via variables.
(The values for the variables get specified in the job.properties file.) Since this Oozie
workflow could be scheduled to run multiple times and you most likely do not want to
have any manual intervention between runs, code a <prepare> element that deletes
your output directory. (Remember the MapReduce job fails, if the output directory
already exits.) You also need to code your configuration information. So following
<map-reduce>, code:
<job-tracker>${jobtracker}</job-tracker>
<name-node>${namenode}</name-node>
<prepare>
<delete path="${PREFIX}/${wf:user()}/myoutput"/>
</prepare>
<configuration>
</configuration>
__ 7. Within the <configuration> elements, code the properties that specify that you are
using the new MapReduce API, your MapReduce mapper class, reducer class,
output key and value type, input directory and output directory. (There are additional
properties that could be specified, like the number of map tasks, but we are not
going to worry about those now.)
Note
The wordcount application was written using the new MapReduce API so you need to
inform Oozie that you are using this new API.
<!-- These first two properties indicate the use of the new API -->
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
Copyright IBM Corp. 2012, 2013
1-3
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>org.apache.hadoop.examples.TokenizerMapper</value>
</property>
<property>
<name>mapreduce.reduce.class</name>
<value>org.apache.hadoop.examples.IntSumReducer</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/${wf:user()}/GutenbergDocs</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/${wf:user()}/myoutput</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
__ 8. Remember to save your work.
Note
If there was a desire to invoke another MapReduce application based upon the successful
completion of the Word Count application, then instead of coding <ok to=end /> for an
action node, you might code <ok to=anotheraction /> and then code a second action node
with a name of anotheraciton that invokes your other MapReduce application. But since
you have done a good deal of typing so far, I will not subject you to coding another action
node.
1-4
V4.1
Instructor Exercises Guide
EXempty
__ 9. Next create a job.properties file in the ~/exercise4 directory. This is where you define
the various variables. Code the following in this file:
#JobTracker and NodeName
jobtracker=ibmclass:9001
namenode=ibmclass:9000
#prefix of the HDFS path for input and output, adapt!
PREFIX=hdfs://ibmclass:9000/user
#HDFS path where you need to copy workflow.xml and lib/*.jar to
oozie.wf.application.path=hdfs://ibmclass:9000/user/biadmin/exercise4/
#one of the values from Hadoop mapred.queue.names
queueName=default
__ 10. Save your work.
__ 11. From a command line, create a new directory under the ~/exercise4 directory called
lib.
mkdir ~/exercise4/lib
__ 12. Then copy the supplied WordCount.jar file to the ~/exercise4/lib directory.
cp /home/labfiles/DW61/examples/WordCount.jar ~/exercise4/lib
__ 13. Next from a command line, copy the exercise4 directory from the client to the
hadoop cluster. (The job.properties file actually does not have to be moved into
Hadoop but the following put statement will copy the entire directory contents.)
hadoop fs -put ~/exercise4 exercise4
Note
If you have a problem with your workflow, remember that any changes that you make must
be copied to HDFS before they can be used.
1-5
__ b. The second approach is to use the BigInsights Web Console. In Firefox, type a
URL of http://ibmclass:8080. Log in and click the Application Status tab and
then the Workflows hyperlink.
__ 16. To see your results
hadoop fs -cat myoutput/part-r-00000
Important
If your workflow fails, you can use the Web Console to drill down into the details about your
workflow for either your map or reduce tasks. Click the Attempt ID to view the system log.
1-6
V4.1
Instructor Exercises Guide
EXempty
You can use <property> elements to pass parameters to the workflow.xml. Add a
property to set the workflow.xml variable, queueName, to the value default.
Properties are defined within <configuration> elements. Code the following after the
<app-path>...</app-path> element.
<configuration>
<property>
<name>queueName</name>
<value>default</value>
</property>
</configuration>
__ 22. Save your work.
__ 23. Create another file in the coord directory called coordinator.propertes. (There is no
need for this file to actually be in this directory. But this is as good as place as any.)
__ 24. Code the following into this file. I will explain about the file afterward. Change the
startTime value to some date in the future. Set the endTime to one month, or so,
later.
oozie.coord.application.path=hdfs://ibmclass:9000/user/biadmin/coord
freq=1440
startTime=2012-07-12T14:00Z (Set this to a date in the future)
endTime=2012-08-01T14:00Z
(Set this to a date in the future)
timezone=UTC
workflowPath=hdfs://ibmclass:9000/user/biadmin/exercise4
jobtracker=ibmclass:9001
namenode=ibmclass:9000
PREFIX=hdfs://ibmclass:9000/user
__ 25. Save your work.
oozie.oord.application.path points to the directory in the Hadoop file system that has the
coordinator.xml file.
freq=1440 is one day in seconds
startTime=... is the time to start the first execution of the workflow
endTime= ...echo is the time that the schedule will end
timezone=UTC says to use what use to be referred to as Greenwich Mean Time.
workflowPath=... points to the directory in the Hadoop file system that has the
workflow.xml that is to be executed.
The jobtracker, nodename, and PREFIX variables required by the workflow.xml.
1-7
__ 26. Next from a command line, copy the coord directory from the client to the Hadoop
cluster. (The coordinator.properties file actually does not have to be moved into
Hadoop but the following put statement will copy the entire directory contents.)
__ 27.hadoop fs -put ~/coord coord
__ 28. Execute the Oozie coordinator.
cd $BIGINSIGHTS_HOME/oozie/bin
export OOZIE_URL=http://ibmclass:8280/oozie
./oozie job -run -config ~/coord/coordinator.properties
Important
If you code something incorrectly, then you probably will get a 500 Internal Server Error,
which seems to be a catch-all error message. For instance when testing this exercise, I
type config and not -config. This resulted in the 500 Internal Server Error. Go figure.
__ 29. If you use the Web Console, you can see those jobs the are currently scheduled.
Return to the Web Console.
__ 30. Click the Application Status tab. Then click the Scheduled Workflows hyperlink.
Your scheduled workflow should be displayed. Click it to get additional detail
information.
V4.1
Instructor Exercises Guide
EXempty
You are going to use this project to see how to publish an application using BigInsights.
(Since you did not use a BigInsights project when you developed your application, you
cannot use the BigInsights development testing capabilities. That is why you are using
another project.)
__ 1. Open Eclipse using the icon on your desktop. Take the default workspace.
__ 2. Switch to the BigInsights perspective. Window->Open Perspective->Other. Select
BigInsights and click OK.
__ 3. In Eclipse, click File->Import.
__ 4. Expand General and select Existing Projects into Workspace. Click Next.
__ 5. Select the Select root directory radiobutton.
__ 6. Click the Browse pushbutton and drill down to
File System->Home->labfiles->DW61->MapReduce.
__ 7. Select LoadMaxTemp and click OK.
__ 8. Select the Copy projects into workspace checkbox.
__ 9. Click Finish.
}
__ g. Test your connection. If the test is successful, click OK and the Finish.
Copyright IBM Corp. 2012, 2013
1-9
__ j.
---------------------------------------------------------------------------------------------------------------------__ 13. Select Create New Application. An application name is pre-filled. Just go with that.
You can select an icon that is to be associated with your application. (We will not do
that now.)
__ 14. In Categories, type in Test. Then click Next.
__ 15. Select an Application Type of Workflow. Click Next.
__ 16. You are going to create a new single action workflow.xml file. The action type of
MapReduce is already selected. You can look at that other choices in the drop-down
box but make sure MapReduce is ultimately selected.
The process, through which you are going, is creating an Oozie workflow. For a
MapReduce workflow, Oozie needs to know the names of the mapper and reducer
classes. Do this by adding properties.
__ 17. Click the New pushbutton.
__ 18. From the Name drop down box select mapreduce.map.class.
__ 19. Specify a value of MaxTempMapper. (That was the name given to the mapper
class.) Click OK.
__ 20. Add a second property. Click the New pushbutton.
__ 21. From the Name drop down box select mapreduce.reduce.class.
__ 22. Specify a value of MaxTempReducer. Click OK.
The input and output directories were not hardcoded in this application. You need to
specify properties so that Oozie can pass that information to the application.
__ 23. Click the New pushbutton.
__ 24. From the Name drop down box select mapred.input.dir.
__ 25. Click OK.
__ 26. Click the New pushbutton.
__ 27. From the Name drop down box select mapred.output.dir.
__ 28. Click OK.
__ 29. Click Next.
__ 30. You are going to pass two parameters to your application. Because of the input and
output directory properties that you just specified, two parameters are already listed.
__ 31. Select the inputDir parameter and click Edit.
V4.1
Instructor Exercises Guide
EXempty
1-11
__ 51. Click the Run hyperlink and select the LoadMaxTemp icon.
You now are presented with a graphical interface which you can use to pass runtime
parameters to the Oozie workflow.
For this learning opportunity, there is not a need for you to run the application since our
focus is on creating an Oozie workflow that was created when you deployed the
application. You can actually view the created workflow.xml by looking in HDFS in the
application directory that is under the user directory.
V4.1
Instructor Exercises Guide
EXempty
1-13
<job-tracker>${jobtracker}</job-tracker>
<name-node>${namenode}</name-node>
<prepare>
<delete path="${PREFIX}/${wf:user()}/myoutput"/>
</prepare>
<configuration>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>org.apache.hadoop.examples.TokenizerMapper</value>
</property>
<property>
<name>mapreduce.reduce.class</name>
<value>org.apache.hadoop.examples.IntSumReducer</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/${wf:user()}/GutenbergDocs</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/${wf:user()}/myoutput</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
</map-reduce>
V4.1
Instructor Exercises Guide
EXempty
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Bummer, error message[${wf:errorMessage()}]</message>
</kill>
<end name="end"/>
</workflow-app>
End of exercise
1-15
V5.4
Instructor Exercises Guide
EXempty
Requirements
Requires the DW612 lab image.
5-1
Introduction
Flume is a distributed service for efficiently moving the large amounts of data. The original
use case of Flume was to gather a set of log files on various machines in a cluster and
aggregate them to a centralized persistent store such as Hadoop Distributed File System
(HDFS). It has since been re-architected and expanded to cover a much wider variety of
data.
Probably the most important ingredient for this exercise is your imagination. You are going
to have to draw upon the inner child within you. Your exercises are running on a BigInsights
cluster that has a whopping one node. So it should be obvious that you are not going to be
able to move data from one node to another. For this exercise to be somewhat meaningful,
you are going to have to image that, when you start multiple agents, you have multiple
nodes and that Flume is running on each of those nodes.
5-2
V5.4
Instructor Exercises Guide
EXempty
5-3
5-4
V5.4
Instructor Exercises Guide
EXempty
Note
To terminate your running agent, do a cntl-z in the console window. This is true if for both
an agent that did not initialize properly due to a configuration error and one that is running
just fine. The cntl-z does not terminate the Java process however.
If you have a configuration error, do a cntl-z, correct your problem and restart your agent.
You do not have to worry about killing the process. The existing process is able to reload
the configuration file.
Now start your flume agent. Override the default logging information and write
informational records to the console. Your agents name is agent1. Note: a lot of
data will be written to the console. cnlt-z will terminate the output.
bin/flume-ng agent --name agent1 --conf-file conf/flume_agent1.properties
-Dflume.root.logger=INFO,console
Or
bin/flume-ng agent -n agent1 -f conf/flume_agent1.properties
-Dflume.root.logger=INFO,console
__ 13. Terminate the agents Java process. First find the process id.
ps -ef | grep flume
__ 14. Then kill the process. (If kill does not work, do a kill -9.)
kill pid or kill -9 pid
5-5
agent13 is to do the initial read of the data from files dropped into a specified directory.
__ 1. First create your directory. From a command line
mkdir ~/sourcedata
__ 2. Open a new file in your text editor.
__ 3. Define the source, sink, and channel to be used by agent13 as well as the bindings.
__ a. The source name is spoolDirSource. Its type is spoolDir.
__ b. The sink name is avroSink. Its type is avro. (Remember that to pass events from
one agent to another requires avro sinks and sources.)
The hostname for binding is ibmclass and the port is 10013.
__ c. The channel name is memChannel. its type is memory and it has a capacity of
100.
#These statements are for agent13
agent13.sources = spoolDirSource
agent13.sinks = avroSink
agent13.channels = memChannel
agent13.sources.spoolDirSource.type = spooldir
agent13.sources.spoolDirSource.spoolDir = /home/biadmin/sourcedata
agent13.sinks.avroSink.type = avro
agent13.sinks.avroSink.hostname = ibmclass
agent13.sinks.avroSink.port = 10013
agent13.channels.memChannel.type = memory
agent13.channels.memChannel.capacity = 100
agent13.sources.spoolDirSource.channels = memChannel
agent13.sinks.avroSink.channel = memChannel
__ 4. agent99 is to get its events from agent13. Since the avro sink for agent13 was
bound to ibmclass at port 10013, that implies that the avro source for agent99 will
also be bound to ibmclass at port 10013.
Also, you are going to enhance your events by adding a timestamp to the header for
each event.
5-6
V5.4
Instructor Exercises Guide
EXempty
Define the source, sink, and channel to be used by agent99 as well as the bindings.
__ a. The source name is avroSource. Its type is avro.
The bind parameter is ibmclass
The port is 10013
The interceptor type is ts
The interceptor type is timestamp
__ b. The sink name is avroSink. Its type is avro.
The hostname for binding is ibmclass and the port is 10099.
__ c. The channel name is memChannel. its type is memory and it has a capacity of
100.
#These statements are for agent99
agent99.sources = avroSource
agent99.sinks = avroSink
agent99.channels = memChannel
agent99.sources.avroSource.type = avro
agent99.sources.avroSource.bind = ibmclass
agent99.sources.avroSource.port = 10013
agent99.sources.avroSource.interceptors = ts
agent99.sources.avroSource.interceptors.ts.type = timestamp
agent99.sinks.avroSink.type = avro
agent99.sinks.avroSink.hostname = ibmclass
agent99.sinks.avroSink.port = 10099
agent99.channels.memChannel.type = memory
agent99.channels.memChannel.capacity = 100
agent99.sources.avroSource.channels = memChannel
agent99.sinks.avroSink.channel = memChannel
__ 5. agent86 is to get its events from agent99. So there must be an avro source to
receive the data and the data is to be passed to an hdfs sink.
Define the source, sink, and channel to be used by agent86 as well as the bindings.
__ a. The source name is avroSource. Its type is avro.
The bind parameter is ibmclass
The port is 10099
__ b. The sink name is hdfsSink. Its type is hdfs.
5-7
A portion of the hdfs path is created by extracting date and time information from
the header of each event.
hdfs://ibmclass:9000/user/biadmin/flume/%y-%m-%d/%H%M
The fliePrefix is log.
The writeFormat is Text
The fileType is DataStream
__ c. The channel name is memChannel. Its type is memory and it has a capacity of
100.
#These statements are for agent86
agent86.sources = avroSource
agent86.sinks = hdfsSink
agent86.channels = memChannel
agent86.sources.avroSource.type = avro
agent86.sources.avroSource.bind = ibmclass
agent86.sources.avroSource.port = 10099
agent86.sinks.hdfsSink.type = hdfs
agent86.sinks.hdfsSink.hdfs.path =
hdfs://ibmclass:9000/user/biadmin/flume/%y-%m-%d/%H%M
agent86.sinks.hdfsSink.hdfs.filePrefix = Log
agent86.sinks.hdfsSink.hdfs.writeFormat = Text
agent86.sinks.hdfsSink.hdfs.fileType = DataStream
agent86.channels.memChannel.type = memory
agent86.channels.memChannel.capacity = 100
agent86.sources.avroSource.channels = memChannel
agent86.sinks.hdfsSink.channel = memChannel
__ 6. Save your work into File System->opt->ibm->biginsights->flume->conf and call
the file flume_agents.properties.
agent13
__ 7. Open a terminal window. Change to the flume directory.
cd $BIGINSIGHTS_HOME/flume
__ 8. When you start agent13, even though you coded your configuration statements
correctly, you will see what looks like a Java exception when you start the agent.
That is because the avro sink is not able to connect to the source yet. Once agent99
5-8
V5.4
Instructor Exercises Guide
EXempty
starts and the avro source does its bind, you should see a statement something like
the following:
13/08/22 17:25:34 INFO sink.AvroSink: Avro sink avroSink: Building RpcClient
with hostname: ibmclass, port: 10013
Execute the following:
bin/flume-ng agent -n agent13 -f conf/flume_agents.properties
-Dflume.root.logger=INFO,console
agent99
__ 9. Open a terminal window. Change to the flume directory.
cd $BIGINSIGHTS_HOME/flume
__ 10. When you start agent99, even though you coded your configuration statements
correctly, you will see what looks like a Java exception when you start the agent.
That is because the avro sink is not able to connect to the source yet. Once agent86
starts and the avro source does its bind, you should see a statement something like
the following:
13/08/22 17:26:44 INFO sink.AvroSink: Avro sink avroSink: Building RpcClient
with hostname: ibmclass, port: 10099
Execute the following:
bin/flume-ng agent -n agent99 -f conf/flume_agents.properties
-Dflume.root.logger=INFO,console
agent86
__ 11. Open a terminal window. Change to the flume directory.
cd $BIGINSIGHTS_HOME/flume
__ 12. Start agent86. You should see a statement as follows:
13/08/22 17:26:42 INFO source.AvroSource: Avro source avroSource started.
Execute the following:
bin/flume-ng agent -n agent86 -f conf/flume_agents.properties
-Dflume.root.logger=INFO,console
5-9
End of Exercise
V8.1
backpg
Back page