Vous êtes sur la page 1sur 64

V8.

cover

Front cover

InfoSphere BigInsights
Foundation
(Course code DW612)

Instructor Exercises Guide


ERC 2.0

Instructor Exercises Guide

Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corp., registered in many jurisdictions worldwide.
The following are trademarks of International Business Machines Corporation, registered in
many jurisdictions worldwide:
BigInsights
GPFS
InfoSphere
WebSphere

Cognos
Guardium
Notes

DB2
Informix
Power

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
both.
Windows is a trademark of Microsoft Corporation in the United States, other countries, or
both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks
of Oracle and/or its affiliates.
VMware and the VMware "boxes" logo and design, Virtual SMP and VMotion are registered
trademarks or trademarks (the "Marks") of VMware, Inc. in the United States and/or other
jurisdictions.
Netezza is a trademark or registered trademark of IBM International Group B.V., an IBM
Company.
Other product and service names might be trademarks of IBM or other companies.

September 2013 edition


The information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without
any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer
responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While
each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will
result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.

Copyright International Business Machines Corporation 2012, 2013.


This document may not be reproduced in whole or in part without the prior written permission of IBM.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

V8.1
Instructor Exercises Guide

TOC

Contents
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Instructor exercises overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Exercises configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Exercises description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Exercise 1. Exploring Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 1: Configure your image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 2: Start BigInsights components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 3: Using the Hadoop command line interface . . . . . . . . . . . . . . . . . . . . . .
Section 4: Running a sample MapReduce program from the command line . . . . .
Section 5: Using the Hadoop web interface to check startup status . . . . . . . . . . .
Section 6: Work with the BigInsights Web Console . . . . . . . . . . . . . . . . . . . . . . . .
Section 7: Cluster Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 8: Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 9: Application Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-1
1-2
1-3
1-3
1-4
1-5
1-6
1-6
1-6
1-7

Exercise 2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part 1: Basic MapReduce development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 1: Define a Java project in Eclipse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 2: Create a Java package and the mapper class . . . . . . . . . . . . . . . . . . .
Section 3: Complete the mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 4: Create the reducer class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 5: Complete the reducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 6: Create the driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 7: Create a JAR file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 8: Running your application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 9: Add a combiner function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 10: Recreate the JAR file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 11: Running your application again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 1: Create the MapReduce templates. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-1
1-2
1-2
1-3
1-3
1-4
1-4
1-5
1-6
1-6
1-6
1-7
1-7
1-7

Exercise 3. Hadoop Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


Section 1: Cluster Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 2: Cluster Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 3: The Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 4: Configuration changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-1
1-2
1-2
1-3
1-4

Exercise 4. Controlling Workloads with Oozie . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


Section 1: Start the BigInsights components . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 2: Create an Oozie workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 3: Working with an Oozie coordinator . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-1
1-2
1-2
1-6

Copyright IBM Corp. 2012, 2013


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Contents

iii

Instructor Exercises Guide

Section 4: Load an exiting BigInsights project . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8


Section 5: Publish your application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9
Section 6: Deploy and run your application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11
Section 7: Oozie Coordination editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12
Section 8: Coding application dependencies setup . . . . . . . . . . . . . . . . . . . . . . . 1-12
Section 9: Create a linked application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12
Section 10: Running a linked application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13
Section 11: Workflow code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13
Section 12: Coordinator code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-15
Exercise 5. Configuring Flume for Data Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-1
Section 1: Creating a configuration file for basic testing . . . . . . . . . . . . . . . . . . . . 5-3
Section 2: Test your first agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Section 3: A more complicated configuration overview . . . . . . . . . . . . . . . . . . . . . 5-5
Section 4: Setup agent13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
Section 5: Test your configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
Section 6: Transfer data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9

iv

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V8.1
Instructor Exercises Guide

TMK

Trademarks
The reader should recognize that the following terms, which appear in the content of this
training document, are official trademarks of IBM or other companies:
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corp., registered in many jurisdictions worldwide.
The following are trademarks of International Business Machines Corporation, registered in
many jurisdictions worldwide:
BigInsights
GPFS
InfoSphere
WebSphere

Cognos
Guardium
Notes

DB2
Informix
Power

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
both.
Windows is a trademark of Microsoft Corporation in the United States, other countries, or
both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks
of Oracle and/or its affiliates.
VMware and the VMware "boxes" logo and design, Virtual SMP and VMotion are registered
trademarks or trademarks (the "Marks") of VMware, Inc. in the United States and/or other
jurisdictions.
Netezza is a trademark or registered trademark of IBM International Group B.V., an IBM
Company.
Other product and service names might be trademarks of IBM or other companies.

Copyright IBM Corp. 2012, 2013


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.

Trademarks

Instructor Exercises Guide

vi

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V8.1
Instructor Exercises Guide

pref

Instructor exercises overview


All the exercises depend on the previous exercises being successfully
completed where noted in the exercise.

Copyright IBM Corp. 2012, 2013

Instructor exercises overview

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

vii

Instructor Exercises Guide

viii

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V8.1
Instructor Exercises Guide

pref

Exercises configuration
Each student has a separate system and students work
independently.

Copyright IBM Corp. 2012, 2013

Exercises configuration

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

ix

Instructor Exercises Guide

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V8.1
Instructor Exercises Guide

pref

Exercises description
In the exercise instructions, you can check off the line before each
step as you complete it to track your progress.
Most exercises include required sections which should always be
completed. It might be necessary to complete these sections before
you can start later exercises. Some exercises might also include
optional sections that you might want to complete if you have sufficient
time and want an extra challenge.

Copyright IBM Corp. 2012, 2013

Exercises description

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

xi

Instructor Exercises Guide

xii

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

Exercise 1. Exploring Apache Hadoop


Estimated time
0:30

What this exercise is about


The student will work with Hadoop view the command line, where the
student will list directories, create directories and upload files to HDFS.
Then the student will invoke a MapReduce job from the command line
and view cluster information using a web browser.

What you should be able to do


At the end of this exercise, you should be able to:
- List the contents of an HDFS directory
- Create an HDFS directory
- Upload data from the local system to an HDFS directory
- Invoke a MapReduce job from the command line
- View Hadoop cluster information via a web browser

Requirements
Requires the DW612 lab images.

Copyright IBM Corp. 2012, 2013

Exercise 1. Exploring Apache Hadoop

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-1

Instructor Exercises Guide

Exercise instructions
Preface
Log in to your image with a userid of biadmin and a password of ibm2blue. If you need to
do something with root, the password is dalvm3.

Section 1: Configure your image


As copies are made of the VMWare image, additional network devices get defined and the
IP address changes. Configuration changes are required to get BigInsights to work.
__ 1. In the lower left corner, click Computer.
__ 2. In the list on right, under System, select YaST. When prompted for a password,
specify dalvm3.
__ 3. In the list on the left side, click Network Devices.
__ 4. Under Network Devices, select Network Settings.
__ 5. If you see two Ethernet controller entries, select the one that is configured as DHCP.
__ 6. Click the Delete pushbutton.
__ 7. The other entry (Not configured) should now be selected. Click the Edit pushbutton.
__ 8. Click Next and then OK.
__ 9. Close the YaST Control Center
__ 10. Open a terminal window. Right-click the desktops background and choose Open in
Terminal.
__ 11. Switch user to root. Specify a password of dalvm3.
su __ 12. Check the IP address.
ifconfig
__ 13. Note your IP address. ___________________
__ 14. Check your hostname
hostname
__ 15. Edit the hosts file.
gedit /etc/hosts
__ 16. Add your hostname and IP address or if one exists, modify it. For example:
192.168.70.160 ibmclass
__ 17. Save your work and close gedit.
__ 18. Exit from root.
1-2

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

exit

Section 2: Start BigInsights components


You will not need to start all BigInsights components to do the exercises for this course.
You will start the needed components at the appropriate time. Initially you will only need to
start Hadoop.
__ 1. From the command line, change the directory to the BigInsights bin directory.
cd $BIGINSIGHTS_HOME/bin
__ 2. Start Hadoop. If you are prompted for a password, specify ibm2blue and click OK.
./start.sh hadoop

Section 3: Using the Hadoop command line interface


Use the command line approach and invoke the file system (fs) shell using the format:
hadoop fs <args>.
__ 3. In a command window (you should still have one open) key the following command
to list the Hadoop root directory.
hadoop fs -ls /
__ 4. Each individual user has a home directory, named after their user name. You can
see the contents of the /user directory by typing the command:
hadoop fs ls /user
__ 5. Create a subdirectory called GutenbergDocs in your user directory. Type in the
command:
hadoop fs mkdir GutenbergDocs
__ 6. Then list your home directory. Note that the directory, that you just created,
defaulted to your home directory. This is like Linux. If you wanted the directory
elsewhere, then you need to fully qualify it.
hadoop fs -ls
__ 7. Next upload all files to the Hadoop system that are in the
/home/labfiles/DW61/GutenbergDocs directory.
hadoop fs -copyFromLocal /home/labfiles/DW61/GutenbergDocs/*.*
GutenbergDocs
Note
Instead of doing a mkdir and then copying all files from the GutenbergDocs local directory,
you could have done it all in one step.
hadoop fs -copyFromLocal /home/labfiles/DW61/GutenbergDocs /user/biadmin

Copyright IBM Corp. 2012, 2013

Exercise 1. Exploring Apache Hadoop

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-3

Instructor Exercises Guide

would have created the GutenbergDocs directory in HDFS and copied the files from the
local file system to HDFS.

__ 8. List the GutenbergDocs directory.


hadoop fs -ls GutenbergDocs
__ 9. Note the number displayed before the userid biadmin. In this case it is a 1. This is
the replication factor.
In the presentation material, it was stated that the default replication factor is 3. But this
shows that for these files, the replication factor is one. Why is that?
There is only one node in this cluster so that is taken into account and the system set
the replication factor to one. A factor of three, under these circumstances just does not
make sense.

Section 4: Running a sample MapReduce program from the command


line
__ 10. If you navigate on the client system (not in Hadoop) to the
/home/labfiles/DW61/examples directory, you will find a file, WordCount.jar. This is
the hadoop MapReduce Hello World equivalent. This application reads a text file
and counts the occurrences of each word in that file.
ls /home/labfiles/DW61/examples/*.*
__ 11. Change to the directory that contains the JAR file.
cd /home/labfiles/DW61/examples
__ 12. Execute this MapReduce program specifying the walden.txt file that you just copied
to the HDFS as input and output the results to a directory called countout. From the
command line execute the following:
hadoop jar WordCount.jar org.apache.hadoop.examples.WordCount
/user/biadmin/GutenbergDocs/walden.txt /user/biadmin/countout
Note
It is important to note that the output directory of a MapReduce program may not exist
when the job is run. If the output directory does exist, the job will fail.

__ 13. Monitor the logging information. When the job completes, view the results:
hadoop fs -ls countout
The part-r-ooooo file is the one that you want to view.

1-4

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

hadoop fs -cat countout/part-r-00000


This is just a simple word count program where each token is terminated by a space.
Hence vale vale, and vale. are all treated as different words.

Section 5: Using the Hadoop web interface to check startup status


Hadoop comes with several web interfaces which are by default (see
conf/hadoop-default.xml) available at these locations:
- http://localhost:50070/ web UI for HDFS namenode(s)
- http://localhost:50030/ web UI for MapReduce JobTracker
- http://localhost:50060/ web UI for TaskTracker(s)
These web interfaces provide concise information about what's happening in your
Hadoop environment.

HDFS NameNode web interface


The namenode web UI shows you a cluster summary including information about
total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the
HDFS namespace and view the contents of its files in the web browser. It also gives
access to the local machine's Hadoop log files (the machine on which the web UI is
running.)
__ 14. Open Firefox. (There should be an icon on your desktop or from a command line
execute firefox &).
__ 15. In the newly open web browser type in the address: http:/ibmclass:50070. View the
Cluster summary information.
__ 16. Click the Browse the filesystem hyperlink.
__ 17. You can drill down on any of the hyperlinks in the Name column. Do so for user.
__ 18. Return to the Cluster Summary page (click the Go back to DFS home hyperlink)
__ 19. Click the Namenode logs hyperlink. This is a good place to view the logs.

MapReduce jobtracker web interface


The job tracker web UI provides information about general job statistics of the Hadoop
cluster, running/completed/failed jobs and a job history log file. It also gives access to
the local machine's Hadoop log files.
__ 20. In the web browser address field, type in http://localhost:50030.
__ 21. Here you can view any running or completed jobs. You can drill down on any jobid to
get detailed information. (You should see your wordcount job.) Drill down on it and
view the statistics.
__ 22. At the bottom of the page, click the Go back to JobTracker hyperlink.
Copyright IBM Corp. 2012, 2013

Exercise 1. Exploring Apache Hadoop

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-5

Instructor Exercises Guide

__ 23. The Log hyperlink allows you to view this TaskTrackers log files.
__ 24. The Job Tracker History hyperlink displays jobs that this jobtracker previously ran.
__ 25. Back on the Job Tracker administration page, in the upper right corner, are some
quick links to navigate the JobTracker.

TaskTracker web interface


The TaskTracker web UI shows what is running and non-running task. It also gives
access to the local machine's Hadoop log files.
__ 26. In the web browser address field, type in http://localhost:50060. You can view task
running on this TaskTracker.
__ 27. The Log hyperlink allows you to view this TaskTrackers log files.
__ 28. You can close the web browser.

Section 6: Work with the BigInsights Web Console


To use the web console you need to start the BigInsights component, console. Derby also
needs to be started to use some of the capabilities of the web console.
__ 1. Open a terminal window (or use one that you already have open). Right-click the
desktops background and choose Open in Terminal.
__ 2. Change directory to the BigInsights bin directory.
cd $BIGINSIGHTS_HOME/bin
__ 3. Start the desired components. Execute the following:
./start.sh console derby
__ 4. Once the console starts, return to Firefox.
__ 5. Start the Web Console. In the address field, specify a URL of http://ibmclass:8080.
(There is also a bookmark in the Firefox toolbar that can be used.)
__ 6. Log in with a User Name of biadmin and a Password of ibm2blue.

Section 7: Cluster Status


__ 7. Click the Cluster Status tab. Most of the components are not started, in particular
HttpFS. This component is required for what you are to do next.
__ 8. In the list on the left, select HttpFS. On the right side, click the Start pushbutton.
After the page refreshes, you should see that it has been started.

Section 8: Files
__ 9. Click the Files tab.
__ 10. On the left side, expand hdfs://localhost.localdomain:9000/->user->biadmin.
1-6

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

__ 11. Select biadmin. The icons are the top of this frame are now enabled.
__ 12. Create a new directory under biadmin. Select the Create Directory icon.
__ 13. Type in a name of exercise1. Click OK. The newly created directory should now be
selected.
__ 14. Click the Upload icon.
__ 15. Click Browse. Navigate to File System->/home/labfiles/DW61/GutenbergDocs.
__ 16. Select walden.txt and click Open. Note that the file is added to a list. If you wanted
to upload additional files, browse and select them. They are added to the list. The
files in the list will be uploaded when you click OK. So click OK.
__ 17. Expand the exercise1 directory. If the uploaded file does not appear, then right-click
exercise1 and select Refresh.
__ 18. Select walden.txt. The selected file is displayed on the right side.
__ 19. Select the Rename icon.
__ 20. Rename the file to waldenpond.txt and click OK.
__ 21. Select the Delete icon. Click Yes to remove the file.
__ 22. Click the black box icon (to the left of the refresh icon). This will open a Hadoop fs
shell.
__ 23. In the input area, type the following. Then click Submit.
hadoop fs -rmr exercise1
__ 24. Verify that the directory was removed. In the shell dialog type:
hadoop fs -ls
__ 25. Close the shell dialog.
__ 26. The exercise1 directory might still be listed. Select the Refresh icon.

Section 9: Application Status


__ 27. Click the Applications Status tab. You may get an error Unable to retrieve data
from the server. That is ok. Close that message.
__ 28. Click the Jobs hyperlink. The word count application that you ran earlier is listed.
__ 29. Select the word count application and you see detail task information.
__ 30. If you keep drilling down, you eventually get to the log information.
__ 31. Log out of the Web Console and close Firefox.

End of exercise
Copyright IBM Corp. 2012, 2013

Exercise 1. Exploring Apache Hadoop

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-7

Instructor Exercises Guide

1-8

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

Exercise 2. MapReduce
Estimated time
1:00

What this exercise is about


In this exercises, the student will first develop their MapReduce
application that finds the highest average monthly temperature using a
text editor.
Then the student will use the MapReduce wizard in the BigInsights
Eclipse Development environment to create the basics for the
application. The student can then compare the benefits of using the
development environment.

What you should be able to do


At the end of this exercise, you should be able to:
- Use the BigInsights Eclipse MapReduce wizard to generate a
MapReduce template
- Monitor the execution of a MapReduce job using the
BigInsights web console
- Export a JAR file for your MapReduce application

Requirements
Requires the DW612 lab images.

Copyright IBM Corp. 2012, 2013

Exercise 2. MapReduce

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-1

Instructor Exercises Guide

Exercise instructions
You will use the BIgInsights Eclipse development environment to create a MapReduce
application. You have a file that contains the average monthly temperature from the three
reporting station in Sumner County, Tennessee for the year 2012. Your goal is to find the
maximum average temperature for each month in 2012. The mapper and reducer will emit
key / value pairs that contain the month and the temperature. (In this data, the temperature
values are three-digit integers. They are the average value multiplied by 10.)

Upload your data to Hadoop


Hadoop should still be running. If it is not, then you will have to restart it.
__ 1. From a command line execute:
hadoop fs -mkdir TempData
__ 2. Upload some temperature data from the local file system.
hadoop fs -copyFromLocal /home/labfiles/DW61/SumnerCountyTemp.dat
/user/biadmin/TempData
__ 3. You can view this data from the Files tab in the Web Console or by executing the
following command. The values in the 95th column (354, 353, 353,353, 352...) are
the average daily temperatures. They are the result of multiplying the actual average
temperature value times 10. (That way you dont have to worry about working with
decimal points.)
hadoop fs -cat TempData/SumnerCountyTemp.dat

Part 1: Basic MapReduce development


Section 1: Define a Java project in Eclipse
__ 1. Start Eclipse using the icon on the desktop.
__ 2. Make sure that you are using the Java perspective. Click Window->Open
Perspective->Other. Then select Java (default). Click OK.
__ 3. Create a Java project. Select File->New->Java Project.
__ 4. Specify a Project name of MaxTemp. Click Finish.
__ 5. Right-click the MaxTemp project, scroll down and select Properties.
__ 6. Select Java Build Path.
__ 7. In the Properties dialog, select the Libraries tab.
__ 8. Click the Add Library pushbutton.
__ 9. Select BigInsights Libraries and click Next. Then click Finish. Then click OK.

1-2

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

Section 2: Create a Java package and the mapper class


__ 10. In the Package Explorer expand MaxTemp and right click src. Select
New->Package.
__ 11. Type a Name of com.some.company.dw61. Click Finish.
__ 12. Right-click com.some.company.dw61 and select New->Class.
__ 13. Type in a Name of MaxTempMapper. It will be a public class. Click Finish.
The data type for the input key to the mapper will be LongWritable. The data itself will
be of type Text. The output key from the mapper will be of type Text. And the data from
the mapper (the temperature) will be of type IntWritable.
__ 14. Your class:
__ a. You will need to import java.io.IOException.
__ b. Extend Mapper<LongWritable, Text, Text, IntWritable>
__ c. Define a public class called map.
__ d. Your code should look like the following:
package com.some.company.dw61;
import
import
import
import
import

java.io.IOException;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.Mapper;

public class MaxTempMapper extends


Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
}
}

Section 3: Complete the mapper


You are reading in a line of data. You will want to convert it to a string so that you can do
some string manipulation. You will want to extract the month and average temperature
for each record.
The month begins at the 22th character of the record (zero offset) and the average
temperature begins at the 95th character. (Remember that the average temperature
value is three digits.)
Copyright IBM Corp. 2012, 2013

Exercise 2. MapReduce

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-3

Instructor Exercises Guide

__ 15. In the map method, add the following code (or whatever code you think is required):
String line = value.toString();
String month = line.substring(22,24);
int avgTemp;
avgTemp = Integer.parseInt(line.substring(95,98));
context.write(new Text(month), new IntWritable(avgTemp));
__ 16. Save your work.

Section 4: Create the reducer class


__ 17. In the Package Explorer right-click com.some.company.dw61 and select
New->Class.
__ 18. Type in a Name of MaxTempReducer. It will be a public class. Click Finish.
The data type for the input key to the reducer will be Text. The data itself will be of type
IntWritable. The output key from the reducer will be of type Text. And the data from the
reducer will be of type IntWritable.
__ 19. Your class:
__ a. You will need to import java.io.IOException.
__ b. Exend Reducer<Text, LongWritable, Text, IntWritable>
__ c. Define a public class called reduce.
__ d. Your code should look like the following:
package com.some.company.dw61;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTempReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
}
}

Section 5: Complete the reducer


For the reducer, you want to iterate through all values for a given key. For each value,
check to see if it is higher than any of the other values.
__ 20. Add the following code (or your variation) to the reduce method.
1-4

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

int maxTemp = Integer.MIN_VALUE;


for (IntWritable value: values) {
maxTemp = Math.max(maxTemp, value.get());
}
context.write(key, new IntWritable(maxTemp));
__ 21. Save your work.

Section 6: Create the driver


__ 22. In the Package Explorer right-click com.some.company.dw61 and select
New->Class.
__ 23. Type in a Name of MaxMonthTemp. It will be a public class. Click Finish.
The GenericOptionsParser() will extract any input parameters that are not system
parameters and place them in an array. In your case, two parameters will be passed to
your application. The first parameter is the input file. The second parameter is the
output directory. (This directory must not exist or your MapReduce application will fail.)
Your code should look like this:
package com.some.company.dw61;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import com.some.company.dw61.MaxTempReducer;
import com.some.company.dw61.MaxTempMapper;

public class MaxMonthTemp {


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] programArgs =
new GenericOptionsParser(conf, args).getRemainingArgs();
if (programArgs.length != 2) {
System.err.println("Usage: MaxTemp <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "Monthly Max Temp");
Copyright IBM Corp. 2012, 2013

Exercise 2. MapReduce

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-5

Instructor Exercises Guide

job.setJarByClass(MaxMonthTemp.class);
job.setMapperClass(MaxTempMapper.class);
job.setReducerClass(MaxTempReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(programArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(programArgs[1]));
// Submit the job and wait for it to finish.
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
__ 24. Save your work.

Section 7: Create a JAR file


__ 25. In the Package Explorer, expand the MaxTemp project. Right-click src and select
Export.
__ 26. Expand Java and select JAR file. Click Next.
__ 27. Click the JAR file browse pushbutton. Type in a name of MyMaxTemp.jar. Keep
biadmin as the folder. Click OK.
__ 28. Click Finish.

Section 8: Running your application


__ 29. Open a command line or return to one if it is already open.
__ 30. Change to biadmins home directory.
cd ~
__ 31. Execute your program. At the command line type:
hadoop jar MyMaxTemp.jar com.some.company.dw61.MaxMonthTemp
/user/biadmin/TempData/SumnerCountyTemp.dat /user/biadmin/TempDataOut

Section 9: Add a combiner function


__ 32. Return to your Eclipse development environment. You are going to add a combiner
function to your application. In a real world multi-node cluster, this would allow some
reducing functions to take place on the mapper node and lessen the amount of
network traffic.
1-6

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

__ 33. Look at the code for MaxMonthlyTemp.java. Add the following statement after the
job.setMaperClass(MaxTempMapper.class); statement.
job.setCombinerClass(MaxTempReducer.class);
__ 34. save your work.

Section 10: Recreate the JAR file


__ 35. In the Project Explorer, expand the MaxTemp project. Right-click src and select
Export.
__ 36. Expand Java and select JAR file. Click Next.
__ 37. Click the JAR file browse pushbutton. Type in a name of MyMaxTemp.jar. Keep
biadmin as the folder. Click OK.
__ 38. Click Finish. Override and replace the previous JAR.

Section 11: Running your application again


__ 39. Execute your program. At the command line type. Note: change the output directory.
hadoop jar MyMaxTemp.jar com.some.company.dw61.MaxMonthTemp
/user/biadmin/TempData/SumnerCountyTemp.dat /user/biadmin/TempDataOut2
__ 40. If you compare the statistics of the two runs, you will see for the first run Combine
output records=0 whereas the second run has Combine output records=12.
__ 41. Close the open edit windows in Eclipse.

Part 2: MapReduce with the BigInsights development environment


You just went through the process of creating a MapReduce application. It was relatively
easy since the code was provided. But what if the exercise just said to write a MapReduce
application. That is it. No provided code. Where would you start?
In this part of the exercise you will use the BigInsights development environment to get you
started in this endeavor.

Section 1: Create the MapReduce templates.


__ 1. If you closed Eclipse, then start Eclipse by double-clicking on the Eclipse icon on
your desktop. When you are asked for a workspace, click OK.
__ 2. Select the BigInsights Text Analytics Workflow perspective. From the menubar,
select Window->Open Perspective->Other. Choose BigInsights and click OK.
__ 3. If the BigInsights Task Launcher is no longer open, the click Help->Task launcher
for Big Data.
__ 4. In the BigInsights Task Launcher, the Overview tab should be selected. Click Create
a new BigInsights project.
__ 5. Type in a project name of BIMaxTemp. Click Finish.
Copyright IBM Corp. 2012, 2013

Exercise 2. MapReduce

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-7

Instructor Exercises Guide

__ 6. In the BigInsights Task Launcher, click the Develop tab.


__ 7. Select Create a BigInsights program.
__ 8. Click the Java MapReduce Program radiobutton then click OK. (This is going to
accomplish the same thing as right-clicking on MaxTemp->New->Other. Then
under BigInsights, selecting Java MapReduce Program).
__ 9. For the Source folder, click the Browse pushbutton, expand BIMaxTemp, select src,
and click OK.
__ 10. Type a Package of com.ibm.dw61.
__ 11. For Name, type MaxTempMapper.
The data type for the input key to the mapper will be LongWritable. The data itself will
be of type Text. The output key from the mapper will be of type Text. And the data from
the mapper (the temperature) will be of type IntWritable.
__ 12. For Type of input key, select the Browse pushbutton. Type in LongWritable. in the
displayed list, select LongWritable- org.apache.hadoop.io. Click OK.
__ 13. For Type of input value, select the Browse pushbutton. Type in Text. In the
displayed list, select Text - org.apache.hadoop.io. Click OK.
__ 14. For Type of output key, select the Browse pushbutton. Select Text in the Matching
items list. Click OK.
__ 15. For Type of output values, select the Browse pushbutton. Type in IntWritable. in
the displayed list, select IntWritable- org.apache.hadoop.io. Click OK.
__ 16. Click Next.
__ 17. For the name of the reducer class, type in MaxTempReducer.
__ 18. For Type of output key, select the Browse pushbutton. Select Text in the displayed
list. Click OK.
__ 19. For Type of output values, select the Browse pushbutton. Select IntWritable in the
displayed list, select IntWritable- org.apache.hadoop.io. Click OK.
__ 20. Click Next.
__ 21. For the main class, type in a Package of com.ibm.dw61.
__ 22. For Name type MaxMonthlyTemp. Then click Finish.
__ 23. Look at the code for the three Java classes that were generated. All you need to do
is add the code that is specific for your MapReduce application. You can see that
using this BigInsights wizard, greatly simplifies generating a MapReduce
application.
To speed things up, you are not going to actually add any code to these templates. I
just wanted you to see the benefits of using BigInsights.
__ 24. Close the open edit windows.

1-8

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

End of exercise

Copyright IBM Corp. 2012, 2013

Exercise 2. MapReduce

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-9

Instructor Exercises Guide

1-10 InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

Exercise 3. Hadoop Configuration


Estimated time
0:30

What this exercise is about


The student will modify some Hadoop configuration parameters and
synchronize the changes in the cluster.

What you should be able to do


At the end of this exercise, you should be able to:
- Describe the use of the syncconf.sh command
- Explain how to use variable and groups with configuration files
- Use the BigInsights Web Console to administer the Hadoop
cluster

Requirements
Requires the DW612 lab images.

Copyright IBM Corp. 2012, 2013

Exercise 3. Hadoop Configuration

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-1

Instructor Exercises Guide

Exercise instructions
Start all BigInsight components.
__ 1. Open a terminal window. Right-click the desktops background and choose Open in
Terminal.
__ 2. Change directory to the BigInsights bin directory.
cd $BIGINSIGHTS_HOME/bin
__ 3. You have previously started some BigInsights components individually. You can,
however, start all of the components using the start-all script. Execute the following:
./start-all.sh
__ 4. Once the console starts, open Firefox. Either click the icon on the desktop or from a
command line type:
firefox &
__ 5. Start the Web Console. In the address field, specify a URL of http://ibmclass:8080.
__ 6. Log in with a User Name of biadmin and a Password of ibm2blue.

Section 1: Cluster Status


__ 7. Click the Cluster Status tab. It might take a couple of seconds before the various
components are displayed and their statuses have been updated.
Since you started all of the components from the command line, you should see that
they are all running, except for the Monitoring component. This one is unique. You
seem to have to start it the first time from the console or specifically start it from the
command line. After starting it the first time, from then on, when you do a start-all, the
Monitoring component will start with the others. (It also works out nicely for education
purposes so that you get the chance to start a component from the console.)
__ 8. Start Monitoring. Select it from the list on the left side. Then on the right side, click
the Start pushbutton. After 10 seconds or so, you should see that it has been
started.

Section 2: Cluster Administration


__ 9. On the top of the left side, click Nodes. This will allow you to easily view the status of
each node in the cluster and the roles that each node is playing.
__ 10. On the right side, click the Add nodes pushbutton. This will begin the process of
adding a new node(s) to the cluster.
__ 11. From the Service drop down box, you choose the use for this node. You may add
multiple nodes at once by specifying a Start IP address and then some number of
nodes.

1-2

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

Note
After you fill in the node information, you click the Add pushbutton. That will then only add
the node(s) to the list of New Nodes at the bottom of the dialog. The nodes will not be
created until you click the Save pushbutton.

__ 12. Since you are only running a a single node cluster (can you call a single node a
cluster?), you are not going to actually add a new node. So click the Cancel
pushbutton.

Section 3: The Dashboard


__ 13. Click the Dashboard tab.
__ 14. Click the Select dashboard drop-down list. You can see the three built-in
dashboards listed. Select Data Services. After maybe fifteen seconds each widget
should display some data.
__ 15. Note that this dashboard has multiple tabs. Select the Zookeeper tab just to see
what is displayed there.
__ 16. You can create your own dashboards as well. Click the New Dashboard icon that is
to the third icon to the right of the drop-down box. (Move your cursor over the three
icons and hover text will appear.)
__ 17. Call your new dashboard MyDashboard. Click OK.
__ 18. Click the NewTab text in the displayed tab. Now you can change that to something
descriptive.
__ 19. Click the Add Widget pushbutton.
__ 20. From the widget list, select Add Widget under the Total Number of Available Host in
the Cluster widget. Scroll down and select Add Widget under the Total Number of
Files in HDFS widget. Then close the widget selection panel.
__ 21. You can grab a widget and rearrange it on the dashboard.
__ 22. Click the Configure icon for the total Number of Files in HDFS widget. (It looks like
a green gear.) Here you can modify the configuration of that widget. Close the
configuration panel.
__ 23. Clicking the green plus sign icon next to your dashboards tab creates a new tab for
this dashboard.
__ 24. Click the Save icon to the right of the drop-down box.
__ 25. If you select the drop-down box, you can see that your new dashboard is now listed.

Copyright IBM Corp. 2012, 2013

Exercise 3. Hadoop Configuration

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-3

Instructor Exercises Guide

Note
Dashboards are unique to individual users. They cannot be viewed by another user.

__ 26. Log out of the Web Console and close Firefox.

Section 4: Configuration changes


When configuration changes are made to Hadoop, then those changes must be disbursed
to all of the nodes in the cluster. BigInsights supplies a syncconf.sh script to do this.
In BigInsights, changes are directly made to any of the configuration files. The changes are
made to files in a staging directory. The changes to the staging files are processed and
distributed to the appropriate nodes in the cluster.
__ 1. Stop all running BigInsights components. From a command line
cd $BIGINSIGHTS_HOME/bin
./stop-all.sh
__ 2. First look at the current configuration values for some MapReduce properties.
__ 3. From a command line execute
gedit &
__ 4. Click the Open icon and navigate to
File System->opt->ibm->biginsights->hadoop-conf and open mapred-site.xml.
__ 5. Note that the value of mapred.tasktracker.map.tasks.maximum is 6 and the value of
mapred.tasktracker.reduce.tasks.maximum is 3.
The question is, How were those values set?
__ 6. Close mapred-site.xml.
__ 7. Next open and navigate to
File System->opt->ibm->biginsights->hdm->hadoop-conf-staging and open
mapred-site.xml.
__ 8. If you look at the values for mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum, you see some sort of JSP formula.
numOfCores is a variable that is set in ibm-hadoop.properties.
__ 9. Close mapred-site.xml.
__ 10. Open ibm-hadoop.properties. It is in the same ibm-hadoop.properties directory.
__ 11. Scroll to the bottom of the file and you can see where numOfCores is defined.
var: numOfCores=6

1-4

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

This is setting numOfCores to six. This would be fine if all of the nodes in your cluster
had six cores. But what if some have six and others have eight? Indicating that the
nodes with eight cores only have six means that you are limiting the number of running
tasks. If you set numOfCores to eight, then you would be over-committing the six core
systems.
You can solve this problem using groups.
__ 12. Define two groups as follows and place them before the variable definitions.
group: grpCore6=dummyhost1
group: grpCore8=dummyhost2,ibmclass
What did this accomplish? Associated with group, grpCore6, is one system with a
hostname of dummyhost1. Associated with group, grpCore8, are two systems with
hostnames of dummyhost2 and ibmclass.
__ 13. Change the value for the numOfCores to two
var: numOfCores=2
__ 14. Follow that statement with
@grpCore6: numOfCores=6
@grpCore8: numOfCores=8
__ 15. Save your work and close the file.
What will this accomplish? When the mapred-site.xml file that is in the staging directory
gets push to a node in the cluster, the hostname of that node is checked to see if it
belongs to a group. If it belongs to grpCore6, then numOfCores is set to a value of 6. If
the hostname belongs to grpCore8, then numOfCores will be set to 8. If the hostname
does not belong to any groups, then the value of numOfCores stays as being 2.
The numOfCores value is used in the JSP formulas to arrive at the desired value for the
maximum number of mapper and reducer tasks.
__ 16. Synchronize the Hadoop configuration changes across your cluster. From a
command line type:
cd $BIGINSIGHTS_HOME/bin
./syncconf.sh hadoop
__ 17. In gedit, open mapred-site.xml that is in
File System->opt->ibm->biginsights->hadoop-conf. Note that the maximum
mapper tasks value is now 8 and the maximum reducer tasks value is now 4.
__ 18. Close mapred-site.xml.
__ 19. Again, open ibm-hadoop.properties that is in File
System->opt->ibm->biginsights->hdm->hadoop-conf-staging and move
ibmclass from grpCore8 to grpCore6.
group: grpCore6=dummyhost1,ibmclass
group: grpCore8=dummyhost2
Copyright IBM Corp. 2012, 2013

Exercise 3. Hadoop Configuration

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-5

Instructor Exercises Guide

__ 20. Save your work and close the file.


__ 21. Resynchronize your configuration files.
./syncconf.sh hadoop
__ 22. Open mapred-site.xml that is in
File System->opt->ibm->biginsights->hadoop-conf. The maximum mapper tasks
value is now back to 6 and the maximum reducer tasks value is now back to 3.
__ 23. Close mapred-site.xml.
__ 24. Finally, make use of the default value for numOfCores. Open ibm-hadoop.properties
hat is in File System->opt->ibm->biginsights->hdm->hadoop-conf-staging and
remove ibmclass from both groups.
__ 25. Save your work and close the file.
__ 26. Resynchronize your configuration files.
./syncconf.sh hadoop
__ 27. Open mapred-site.xml that is in
File System->opt->ibm->biginsights->hadoop-conf. The maximum mapper tasks
value is now 2 and the maximum reducer tasks value is now back to 1.
__ 28. Close your editor.

End of exercise

1-6

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

Exercise 4. Controlling Workloads with Oozie


Estimated time
0:45

What this exercise is about


In this exercises, the student will define Oozie workflows and use the
Oozie coordinator to schedule and control the execution of some
MapReduce applications.

What you should be able to do


At the end of this exercise, you should be able to:
- Code an Oozie workflow to invoke an MapReduce application
- Use the Oozie coordinator to schedule repeating occurrences
of a workflow
- Use the BigInsights workflow editor

Requirements
Requires the DW612 lab images.

Copyright IBM Corp. 2012, 2013

Exercise 4. Controlling Workloads with Oozie

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-1

Instructor Exercises Guide

Exercise instructions
Unfortunately the first portion of this lab is going to be a typing exercises. The second part
uses the BigInsights workflow editor to simplify things. The goal is to create a workflow to
invoke the Word Count MapReduce application that comes as a sample application with
BigInsights.
Then if things go well, you continue by creating a coordinator flow that invokes the created
workflow based upon a schedule.
Note
The full text of both the workflow and coordinator code appears at the end of this exercise.

Section 1: Start the BigInsights components


__ 1. Open a command line. (You can right-click on your desktop and select Open in
Terminal) Type the following. If prompted for a password, type ibm2blue.
cd $BIGINSIGHTS_HOME/bin
./start.sh hadoop derby oozie console

Section 2: Create an Oozie workflow


You are going to create the needed workflow in steps.
__ 1. Open a command line. (You can right-click on your desktop and select Open in
Terminal.)
__ 2. Create a directory in the biadmin home directory called exercise4.
mkdir ~/exercise4
__ 3. Change to the exercise4 directory and invoke some desired editor, for example gedit
or kwrite.
__ 4. Code the outer elements of an Oozie workflow. The application name is exercise4.
The first action to be started is wordcount. If there is a problem, then the node fail
gets invoked and a message is to be displayed. To terminate the flow without errors,
the node end is invoked. After coding each step, it might make sense to save your
work. The name of the file must be workflow.xml.
<workflow-app name="Exercise4" xmlns="uri:oozie:workflow:0.1">
<start to="wordcount"/>
<kill name="fail">
<message>Bummer, error message[${wf:errorMessage()}]</message>
</kill>
<end name="end"/>
</workflow-app>
1-2

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

__ 5. After the <start to=wordcount/> statement, code the first action node. Since you
coded start to=wordcount />, the name of this action is wordcount. It follows the
start to... statement. This action node invokes a MapReduce application and upon
successful completion, it is to end normally. If this action node fails, it is to invoke the
kill node that you just coded.
<action name="wordcount">
<map-reduce>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
__ 6. Within the <map-reduce> element, you need to specify your JobTracker and
NameNode. In this exercise, refer to your JobTracker and NameNode via variables.
(The values for the variables get specified in the job.properties file.) Since this Oozie
workflow could be scheduled to run multiple times and you most likely do not want to
have any manual intervention between runs, code a <prepare> element that deletes
your output directory. (Remember the MapReduce job fails, if the output directory
already exits.) You also need to code your configuration information. So following
<map-reduce>, code:
<job-tracker>${jobtracker}</job-tracker>
<name-node>${namenode}</name-node>
<prepare>
<delete path="${PREFIX}/${wf:user()}/myoutput"/>
</prepare>
<configuration>
</configuration>
__ 7. Within the <configuration> elements, code the properties that specify that you are
using the new MapReduce API, your MapReduce mapper class, reducer class,
output key and value type, input directory and output directory. (There are additional
properties that could be specified, like the number of map tasks, but we are not
going to worry about those now.)
Note
The wordcount application was written using the new MapReduce API so you need to
inform Oozie that you are using this new API.

<!-- These first two properties indicate the use of the new API -->
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
Copyright IBM Corp. 2012, 2013

Exercise 4. Controlling Workloads with Oozie

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-3

Instructor Exercises Guide

<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>org.apache.hadoop.examples.TokenizerMapper</value>
</property>
<property>
<name>mapreduce.reduce.class</name>
<value>org.apache.hadoop.examples.IntSumReducer</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/${wf:user()}/GutenbergDocs</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/${wf:user()}/myoutput</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
__ 8. Remember to save your work.
Note
If there was a desire to invoke another MapReduce application based upon the successful
completion of the Word Count application, then instead of coding <ok to=end /> for an
action node, you might code <ok to=anotheraction /> and then code a second action node
with a name of anotheraciton that invokes your other MapReduce application. But since
you have done a good deal of typing so far, I will not subject you to coding another action
node.

1-4

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

__ 9. Next create a job.properties file in the ~/exercise4 directory. This is where you define
the various variables. Code the following in this file:
#JobTracker and NodeName
jobtracker=ibmclass:9001
namenode=ibmclass:9000
#prefix of the HDFS path for input and output, adapt!
PREFIX=hdfs://ibmclass:9000/user
#HDFS path where you need to copy workflow.xml and lib/*.jar to
oozie.wf.application.path=hdfs://ibmclass:9000/user/biadmin/exercise4/
#one of the values from Hadoop mapred.queue.names
queueName=default
__ 10. Save your work.
__ 11. From a command line, create a new directory under the ~/exercise4 directory called
lib.
mkdir ~/exercise4/lib
__ 12. Then copy the supplied WordCount.jar file to the ~/exercise4/lib directory.
cp /home/labfiles/DW61/examples/WordCount.jar ~/exercise4/lib
__ 13. Next from a command line, copy the exercise4 directory from the client to the
hadoop cluster. (The job.properties file actually does not have to be moved into
Hadoop but the following put statement will copy the entire directory contents.)
hadoop fs -put ~/exercise4 exercise4
Note
If you have a problem with your workflow, remember that any changes that you make must
be copied to HDFS before they can be used.

__ 14. Next invoke your Oozie workflow from a command line.


cd $BIGINSIGHTS_HOME/oozie/bin
export OOZIE_URL=http://ibmclass:8280/oozie
./oozie job -run -config ~/exercise4/job.properties
__ 15. You can view the status of your workflow using in one of two ways.
__ a. One way is the built-in JobTracker web interface. In Firefox, in a new tab, type a
URL of http://ibmclass:50030. If your job should appear under Running Jobs or
Completed Jobs.

Copyright IBM Corp. 2012, 2013

Exercise 4. Controlling Workloads with Oozie

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-5

Instructor Exercises Guide

__ b. The second approach is to use the BigInsights Web Console. In Firefox, type a
URL of http://ibmclass:8080. Log in and click the Application Status tab and
then the Workflows hyperlink.
__ 16. To see your results
hadoop fs -cat myoutput/part-r-00000
Important
If your workflow fails, you can use the Web Console to drill down into the details about your
workflow for either your map or reduce tasks. Click the Attempt ID to view the system log.

Section 3: Working with an Oozie coordinator


In the interest of time, you will only code a simple coordinator.xml file in order to schedule a
workflow to run multiple times.
__ 17. From a command line, create a directory coord under the biadmin home directory.
mkdir ~/coord
__ 18. Using your editor of choice, create a new file, coordinator.xml, in this directory. You
will create the coordinator.xml in stages.
__ 19. The start time, end time, timezone and frequency values will be specified as
variables. The schedule name is to be Schedule_WordCount.
<coordinator-app name="Schedule_WordCount" frequency="${freq}"
start="${startTime}" end="${endTime}" timezone="${timezone}"
xmlns="uri:oozie:coordinator:0.1">
</coordinator-app>
__ 20. Between the <coordinator-app> elements specify a workflow that is to be controlled
by the coordinator. Use a variable to specify the path that points to the workflows
directory.
<action>
<workflow>
<app-path>${workflowPath}</app-path>
</workflow>
</action>
__ 21. Now as long as you define all of the variables in the coordinator.properties file, you
are good to go. In this case variables are not only those referenced in the
coordinator.xml file but also in the workflow.xml that is to be run.

1-6

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

You can use <property> elements to pass parameters to the workflow.xml. Add a
property to set the workflow.xml variable, queueName, to the value default.
Properties are defined within <configuration> elements. Code the following after the
<app-path>...</app-path> element.
<configuration>
<property>
<name>queueName</name>
<value>default</value>
</property>
</configuration>
__ 22. Save your work.
__ 23. Create another file in the coord directory called coordinator.propertes. (There is no
need for this file to actually be in this directory. But this is as good as place as any.)
__ 24. Code the following into this file. I will explain about the file afterward. Change the
startTime value to some date in the future. Set the endTime to one month, or so,
later.
oozie.coord.application.path=hdfs://ibmclass:9000/user/biadmin/coord
freq=1440
startTime=2012-07-12T14:00Z (Set this to a date in the future)
endTime=2012-08-01T14:00Z
(Set this to a date in the future)
timezone=UTC
workflowPath=hdfs://ibmclass:9000/user/biadmin/exercise4
jobtracker=ibmclass:9001
namenode=ibmclass:9000
PREFIX=hdfs://ibmclass:9000/user
__ 25. Save your work.
oozie.oord.application.path points to the directory in the Hadoop file system that has the
coordinator.xml file.
freq=1440 is one day in seconds
startTime=... is the time to start the first execution of the workflow
endTime= ...echo is the time that the schedule will end
timezone=UTC says to use what use to be referred to as Greenwich Mean Time.
workflowPath=... points to the directory in the Hadoop file system that has the
workflow.xml that is to be executed.
The jobtracker, nodename, and PREFIX variables required by the workflow.xml.

Copyright IBM Corp. 2012, 2013

Exercise 4. Controlling Workloads with Oozie

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-7

Instructor Exercises Guide

__ 26. Next from a command line, copy the coord directory from the client to the Hadoop
cluster. (The coordinator.properties file actually does not have to be moved into
Hadoop but the following put statement will copy the entire directory contents.)
__ 27.hadoop fs -put ~/coord coord
__ 28. Execute the Oozie coordinator.
cd $BIGINSIGHTS_HOME/oozie/bin
export OOZIE_URL=http://ibmclass:8280/oozie
./oozie job -run -config ~/coord/coordinator.properties
Important
If you code something incorrectly, then you probably will get a 500 Internal Server Error,
which seems to be a catch-all error message. For instance when testing this exercise, I
type config and not -config. This resulted in the 500 Internal Server Error. Go figure.

__ 29. If you use the Web Console, you can see those jobs the are currently scheduled.
Return to the Web Console.
__ 30. Click the Application Status tab. Then click the Scheduled Workflows hyperlink.
Your scheduled workflow should be displayed. Click it to get additional detail
information.

__ 31. Close the Web Console.

Use the BigInsights Workflow Editor


Section 4: Load an exiting BigInsights project
To save time you are going to load an existing BigInsights project into Eclipse. This is
essentially the same MapReduce application that you wrote in the first part of this exercise.
1-8

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

You are going to use this project to see how to publish an application using BigInsights.
(Since you did not use a BigInsights project when you developed your application, you
cannot use the BigInsights development testing capabilities. That is why you are using
another project.)
__ 1. Open Eclipse using the icon on your desktop. Take the default workspace.
__ 2. Switch to the BigInsights perspective. Window->Open Perspective->Other. Select
BigInsights and click OK.
__ 3. In Eclipse, click File->Import.
__ 4. Expand General and select Existing Projects into Workspace. Click Next.
__ 5. Select the Select root directory radiobutton.
__ 6. Click the Browse pushbutton and drill down to
File System->Home->labfiles->DW61->MapReduce.
__ 7. Select LoadMaxTemp and click OK.
__ 8. Select the Copy projects into workspace checkbox.
__ 9. Click Finish.
}

Section 5: Publish your application


Publishing an application creates a workflow.xml file and allows for the application to
appear in the Manage list of applications in the Web Console. From there it can be
deployed so that it can be easily run by users with the proper authority.
__ 10. In Eclipse within the project list, right-click LoadMaxTemp and select Export.
__ 11. Expand BigInsights and select Application Publish. Click Next.
__ 12. If you have already associated a BigInsights server with your development
environment, it is already selected. If you had multiple servers defined, you would
need to select the appropriate one. If you do not have an associated server,
complete the following steps. Otherwise click Next.
------------------------------------------------------------------------------------------------------------------__ a. Click the Create pushbutton.
__ b. Type in the URL of the BigInsights server. For this exercise it is
http://ibmclass:8080.
__ c. You can keep the default Server name.
__ d. User ID is biadmin.
__ e. Password is ibm2blue.
__ f.

Select to save the password.

__ g. Test your connection. If the test is successful, click OK and the Finish.
Copyright IBM Corp. 2012, 2013

Exercise 4. Controlling Workloads with Oozie

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-9

Instructor Exercises Guide

__ h. If prompted to enter a password, enter ibm2blue.


__ i.

Click No on the Secure Storage dialog.

__ j.

You new server should now be selected. Click Next.

---------------------------------------------------------------------------------------------------------------------__ 13. Select Create New Application. An application name is pre-filled. Just go with that.
You can select an icon that is to be associated with your application. (We will not do
that now.)
__ 14. In Categories, type in Test. Then click Next.
__ 15. Select an Application Type of Workflow. Click Next.
__ 16. You are going to create a new single action workflow.xml file. The action type of
MapReduce is already selected. You can look at that other choices in the drop-down
box but make sure MapReduce is ultimately selected.
The process, through which you are going, is creating an Oozie workflow. For a
MapReduce workflow, Oozie needs to know the names of the mapper and reducer
classes. Do this by adding properties.
__ 17. Click the New pushbutton.
__ 18. From the Name drop down box select mapreduce.map.class.
__ 19. Specify a value of MaxTempMapper. (That was the name given to the mapper
class.) Click OK.
__ 20. Add a second property. Click the New pushbutton.
__ 21. From the Name drop down box select mapreduce.reduce.class.
__ 22. Specify a value of MaxTempReducer. Click OK.
The input and output directories were not hardcoded in this application. You need to
specify properties so that Oozie can pass that information to the application.
__ 23. Click the New pushbutton.
__ 24. From the Name drop down box select mapred.input.dir.
__ 25. Click OK.
__ 26. Click the New pushbutton.
__ 27. From the Name drop down box select mapred.output.dir.
__ 28. Click OK.
__ 29. Click Next.
__ 30. You are going to pass two parameters to your application. Because of the input and
output directory properties that you just specified, two parameters are already listed.
__ 31. Select the inputDir parameter and click Edit.

1-10 InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

__ a. Display Name - Input Directory


__ b. Type - select Directory Path
__ c. Description - Either browse for the input directory or type one in.
__ d. Make sure that the Required checkbox is selected and click OK.
__ 32. Select the outputDir parameter and click Edit.
__ a. Display Name - Output Directory
__ b. Type - select Directory Path
__ c. Description - Type the output directory.
__ d. Make sure that the Required checkbox is selected and click OK.
__ 33. Click Next.
The JAR file for this application has not been create.
__ 34. Click the Create jar pushbutton.
__ 35. Click the Browse pushbutton.
__ 36. Select Browse for other folders.
__ 37. Drill down to biadmin->workspace->LoadMaxTemp
__ 38. Type a name of MaxMonthlyTemp.jar.
__ 39. Click OK.
__ 40. Click Finish.
Now add the newly create JAR file to the workflow.
__ 41. Click the Add pushbutton.
__ 42. Here you can specify any additional JAR files or libraries that need to be included.
On the right side, select MaxMonthlyTemp.jar and click OK.
__ 43. Click Finish.

Section 6: Deploy and run your application


__ 44. Return to the Web Console. (Open Firefox and specify a URL of
http://ibmclass:8080)
__ 45. Log in with a user name of biadmin and a password of ibm2blue.
__ 46. Select the Applications tab.
__ 47. Select the Manage hyperlink.
__ 48. On the left, expand Test. Listed is your LoadMaxTemp application. Select it.
__ 49. Click the Deploy pushbutton.
__ 50. Under Security, select biusergrp and click Deploy.
Copyright IBM Corp. 2012, 2013

Exercise 4. Controlling Workloads with Oozie

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-11

Instructor Exercises Guide

__ 51. Click the Run hyperlink and select the LoadMaxTemp icon.
You now are presented with a graphical interface which you can use to pass runtime
parameters to the Oozie workflow.
For this learning opportunity, there is not a need for you to run the application since our
focus is on creating an Oozie workflow that was created when you deployed the
application. You can actually view the created workflow.xml by looking in HDFS in the
application directory that is under the user directory.

Section 7: Oozie Coordination editor


__ 52. Expand Schedule and Advanced Settings. (You may have to scroll down in the
parameter area.)
__ 53. Click the Schedule Job checkbox. Now you have the opportunity to graphically
create an Oozie coordinator.xml file.
__ 54. Click the View Schedule Configuration hyperlink. (You can just keep all of the
schedule default values.)
__ 55. Presented is the coordinator.xml file that will get generated.
__ 56. Click Close.

Section 8: Coding application dependencies setup


First you need to deploy a sample application that you will link together with your existing
LoadMaxTemp application in an Oozie workflow.
__ 57. Click the Manage hyperlink.
__ 58. Next on the left side, expand Import. Select Database Import and click the Deploy
pushbutton.
__ 59. Click the biusergrup checkbox and click Deploy.

Section 9: Create a linked application


__ 60. Click the Link hyperlink.
The goal here is to create a new application that executes the Database Import to load
data from a relational table into a directory in HDFS. After that completes, the
LoadMaxTemp application is run and uses the uploaded data as input.
__ 61. Select the Database Import icon on the left and drag it to the dashed-line box on the
right. The bar at the top of the icon will turn green when it is in the proper position.
(You might have to maximize your browser window to be able to see the dashed-line
box.)
__ 62. Do the same for the LoadMaxTemp icon as well.
__ 63. Type an Application Name of Linked Apps.

1-12 InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

__ 64. Click the Next pushbutton.


The output directory of the Database Import application is to be the input directory for
the LoadMaxTemp application.
__ 65. Click the link icon for the Database Import output directory.
__ 66. Since there are only two applications in your link application, LoadMaxTemp is the
only application that can be chosen. Click the drop down box for Parameter. Select
Input Directory. Click OK.
You have just specified that the output directory for the Database Import application is
the input directory for the LoadMaxTemp application.
__ 67. Click Finish.

Section 10: Running a linked application


The process through which you just worked actually created, what you can think of as, a
new application. And like any published application, it must first be deployed before it
can be run.
__ 68. Click the Manage hyperlink.
__ 69. Now the question is, under which category is the new application listed? It gets listed
under all categories that were associated with all of the linked applications. So for
you, it would appear under Test, Import, and Database. Expand one of those and
select Linked Apps.
__ 70. Click the Deploy pushbutton.
__ 71. Select biusergrp and click Deploy.
__ 72. Click the Run hyperlink and then select the Linked Apps icon. You can see that the
parameters for this new application are a combination of the parameters of the two
applications that were linked, except for the input directory for the LoadMaxTemp
application since that parameter has already been resolved.
Do not try running this new application since there is not a relational database
management system on the image.
__ 73. Log out of the Web Console and close Firefox.
__ 74. Close Eclipse.

This ends this exercise.


Section 11: Workflow code
<workflow-app name="exercise4" xmlns="uri:oozie:workflow:0.1">
<start to="wordcount"/>
<action name="wordcount">
<map-reduce>
Copyright IBM Corp. 2012, 2013

Exercise 4. Controlling Workloads with Oozie

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-13

Instructor Exercises Guide

<job-tracker>${jobtracker}</job-tracker>
<name-node>${namenode}</name-node>
<prepare>
<delete path="${PREFIX}/${wf:user()}/myoutput"/>
</prepare>
<configuration>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>org.apache.hadoop.examples.TokenizerMapper</value>
</property>
<property>
<name>mapreduce.reduce.class</name>
<value>org.apache.hadoop.examples.IntSumReducer</value>
</property>
<property>
<name>mapred.output.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.output.value.class</name>
<value>org.apache.hadoop.io.IntWritable</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/${wf:user()}/GutenbergDocs</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/${wf:user()}/myoutput</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
</map-reduce>

1-14 InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V4.1
Instructor Exercises Guide

EXempty

<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Bummer, error message[${wf:errorMessage()}]</message>
</kill>
<end name="end"/>
</workflow-app>

Section 12: Coordinator code


<coordinator-app name="Schedule_WordCount" frequency="${freq}"
start="${startTime}" end="${endTime}" timezone="${timezone}"
xmlns="uri:oozie:coordinator:0.1">
<action>
<workflow>
<app-path>${workflowPath}</app-path>
<configuration>
<property>
<name>queueName</name>
<value>default</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>

End of exercise

Copyright IBM Corp. 2012, 2013

Exercise 4. Controlling Workloads with Oozie

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

1-15

Instructor Exercises Guide

1-16 InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V5.4
Instructor Exercises Guide

EXempty

Exercise 5. Configuring Flume for Data Loading


Estimated time
00:30

What this exercise is about


This exercise introduces how to configure Flume agents in order to
load data into Hadoop.

What you should be able to do


At the end of this exercise, you should be able to:
Configure Flume agents for data loading into Hadoop

Requirements
Requires the DW612 lab image.

Copyright IBM Corp. 2012, 2013

Exercise 5. Configuring Flume for Data Loading

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

5-1

Instructor Exercises Guide

Introduction
Flume is a distributed service for efficiently moving the large amounts of data. The original
use case of Flume was to gather a set of log files on various machines in a cluster and
aggregate them to a centralized persistent store such as Hadoop Distributed File System
(HDFS). It has since been re-architected and expanded to cover a much wider variety of
data.
Probably the most important ingredient for this exercise is your imagination. You are going
to have to draw upon the inner child within you. Your exercises are running on a BigInsights
cluster that has a whopping one node. So it should be obvious that you are not going to be
able to move data from one node to another. For this exercise to be somewhat meaningful,
you are going to have to image that, when you start multiple agents, you have multiple
nodes and that Flume is running on each of those nodes.

5-2

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V5.4
Instructor Exercises Guide

EXempty

Section 1: Creating a configuration file for basic testing


All of the information that is required by a Flume agent is acquired from a configuration file.
So to begin with, you are going to code up a simple configuration file. You are going to first
define a source and a target that use some built-in Flume testing capabilities.
__ 1. If Hadoop is not running, open a command line. Right-click the desktop and select
Open in Terminal.
__ 2. Start Hadoop. If prompted for a password, enter ibm2blue.
cd $BIGINSIGHTS_HOME/bin
./start.sh hadoop
__ 3. Your Flume configuration file can reside anywhere as long as the agent can access
it. The convention is to place the configuration file in Flumes conf directory.
Start an editor, like gedit or kwrite.
Entries in the configuration file are prefixed with an agents name. Assume that the first
agent with which you are going to work is to be named agent1. Also, since this is
possibly the first time that you have worked with Flume, you will initially make use of
some of the Flume testing capabilities.
The sequential generator source is an easy source to use since it creates the source
data for you. The logger is a good sink with which to play since it can display the results
in your console window.
Remember from the presentation material that a source and a sink are connected
together via a channel. You will use the memory channel for this exercise.
__ 4. Although the order in which the Flume elements are defined is immaterial, I am
going to present them in a way in which I am comfortable. I am going to first tell you
what is to be defined followed by the actual statements. If you want to try your luck in
coding the configuration statements before seeing the answers, I suggest that you
cover the answers with a sheet of paper, code your own statements, and then do a
comparison.
Define your source, sink, and channel. Remember your agents name is agent1.
__ a. Source name is seqGenSource
__ b. Sink name is loggerSink
__ c. Channel name is memChannel
Code the following in your editor:
agent1.sources = seqGenSource
agent1.sinks = loggerSink
agent1.channels = memChannel
__ 5. Code the properties for seqGenSource
__ a. The source type is seq
Copyright IBM Corp. 2012, 2013

Exercise 5. Configuring Flume for Data Loading

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

5-3

Instructor Exercises Guide

Code the following in your editor:


agent1.sources.seqGenSource.type = seq
__ 6. Code the properties for loggerSink
__ a. The sink type is logger
Code the following in your editor:
agent1.sinks.loggerSink.type = logger
__ 7. Code the properties for memChannel
__ a. The channel type is memory
__ b. Its capacity is 100
Code the following in your editor:
agent1.channels.memChannel.type = memory
agent1.channels.memChannel.capacity = 100
__ 8. Connect your source to your defined channel.
agent1.sources.seqGenSource.channels = memChannel
__ 9. Connect your sink to your defined channel.
agent1.sinks.loggerSink.channel = memChannel
Important
Did you note that the binding definition for the source contains the keyword channels
(plural) and the binding definition for the sink contains the keyword channel? This is
because a source can read from multiple channels whereas a sink can only write to a
single channel.

__ 10. Save your work into File System->opt->ibm->biginsights->flume->conf and call


the file flume_agent1.properties.

Section 2: Test your first agent


__ 11. From a command line, change to the flume directory.
cd $BIGINSIGHTS_HOME/flume
__ 12. Since this is just an exercise and exercises should never mirror real life, you are not
going to worry about specifying a configuration directory and setting environment
variables. Information about that was covered in the presentation material.

5-4

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V5.4
Instructor Exercises Guide

EXempty

Note
To terminate your running agent, do a cntl-z in the console window. This is true if for both
an agent that did not initialize properly due to a configuration error and one that is running
just fine. The cntl-z does not terminate the Java process however.
If you have a configuration error, do a cntl-z, correct your problem and restart your agent.
You do not have to worry about killing the process. The existing process is able to reload
the configuration file.

Now start your flume agent. Override the default logging information and write
informational records to the console. Your agents name is agent1. Note: a lot of
data will be written to the console. cnlt-z will terminate the output.
bin/flume-ng agent --name agent1 --conf-file conf/flume_agent1.properties
-Dflume.root.logger=INFO,console
Or
bin/flume-ng agent -n agent1 -f conf/flume_agent1.properties
-Dflume.root.logger=INFO,console
__ 13. Terminate the agents Java process. First find the process id.
ps -ef | grep flume
__ 14. Then kill the process. (If kill does not work, do a kill -9.)
kill pid or kill -9 pid

Section 3: A more complicated configuration overview


You did not have to really use your imagination when working with the first flume agent. But
you will now. Here is the configuration that you are to implement.
Files are periodically dropped into a directory on system A. The data from each of those
files is to be read and turned into events. Each of those events is to be forwarded to system
B where the data is to be enhanced by adding timestamp information into the header for
each event. The enhanced events are then to be sent to system C where the events are to
be loaded into HDFS. The timestamp information in the event header is to be used to
define the directory names in which the events are to be stored.
In real life each of the three agents would run on separate systems and so you would have
three configuration files. But since your three agents are running on the same system, you
can use just a single configuration file.
To add some humor, at least for older people in the U.S.A. who remember the TV series,
Get Smart, the name of the three agents are agent13, agent99, and agent86.

Copyright IBM Corp. 2012, 2013

Exercise 5. Configuring Flume for Data Loading

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

5-5

Instructor Exercises Guide

Section 4: Setup agent13


Note
I purposely chose to use the same channel name for all three agents. This is to show you
that the names only have to be unique within an agent.

agent13 is to do the initial read of the data from files dropped into a specified directory.
__ 1. First create your directory. From a command line
mkdir ~/sourcedata
__ 2. Open a new file in your text editor.
__ 3. Define the source, sink, and channel to be used by agent13 as well as the bindings.
__ a. The source name is spoolDirSource. Its type is spoolDir.
__ b. The sink name is avroSink. Its type is avro. (Remember that to pass events from
one agent to another requires avro sinks and sources.)
The hostname for binding is ibmclass and the port is 10013.
__ c. The channel name is memChannel. its type is memory and it has a capacity of
100.
#These statements are for agent13
agent13.sources = spoolDirSource
agent13.sinks = avroSink
agent13.channels = memChannel
agent13.sources.spoolDirSource.type = spooldir
agent13.sources.spoolDirSource.spoolDir = /home/biadmin/sourcedata
agent13.sinks.avroSink.type = avro
agent13.sinks.avroSink.hostname = ibmclass
agent13.sinks.avroSink.port = 10013
agent13.channels.memChannel.type = memory
agent13.channels.memChannel.capacity = 100
agent13.sources.spoolDirSource.channels = memChannel
agent13.sinks.avroSink.channel = memChannel
__ 4. agent99 is to get its events from agent13. Since the avro sink for agent13 was
bound to ibmclass at port 10013, that implies that the avro source for agent99 will
also be bound to ibmclass at port 10013.
Also, you are going to enhance your events by adding a timestamp to the header for
each event.
5-6

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V5.4
Instructor Exercises Guide

EXempty

Define the source, sink, and channel to be used by agent99 as well as the bindings.
__ a. The source name is avroSource. Its type is avro.
The bind parameter is ibmclass
The port is 10013
The interceptor type is ts
The interceptor type is timestamp
__ b. The sink name is avroSink. Its type is avro.
The hostname for binding is ibmclass and the port is 10099.
__ c. The channel name is memChannel. its type is memory and it has a capacity of
100.
#These statements are for agent99
agent99.sources = avroSource
agent99.sinks = avroSink
agent99.channels = memChannel
agent99.sources.avroSource.type = avro
agent99.sources.avroSource.bind = ibmclass
agent99.sources.avroSource.port = 10013
agent99.sources.avroSource.interceptors = ts
agent99.sources.avroSource.interceptors.ts.type = timestamp
agent99.sinks.avroSink.type = avro
agent99.sinks.avroSink.hostname = ibmclass
agent99.sinks.avroSink.port = 10099
agent99.channels.memChannel.type = memory
agent99.channels.memChannel.capacity = 100
agent99.sources.avroSource.channels = memChannel
agent99.sinks.avroSink.channel = memChannel
__ 5. agent86 is to get its events from agent99. So there must be an avro source to
receive the data and the data is to be passed to an hdfs sink.
Define the source, sink, and channel to be used by agent86 as well as the bindings.
__ a. The source name is avroSource. Its type is avro.
The bind parameter is ibmclass
The port is 10099
__ b. The sink name is hdfsSink. Its type is hdfs.

Copyright IBM Corp. 2012, 2013

Exercise 5. Configuring Flume for Data Loading

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

5-7

Instructor Exercises Guide

A portion of the hdfs path is created by extracting date and time information from
the header of each event.
hdfs://ibmclass:9000/user/biadmin/flume/%y-%m-%d/%H%M
The fliePrefix is log.
The writeFormat is Text
The fileType is DataStream
__ c. The channel name is memChannel. Its type is memory and it has a capacity of
100.
#These statements are for agent86
agent86.sources = avroSource
agent86.sinks = hdfsSink
agent86.channels = memChannel
agent86.sources.avroSource.type = avro
agent86.sources.avroSource.bind = ibmclass
agent86.sources.avroSource.port = 10099
agent86.sinks.hdfsSink.type = hdfs
agent86.sinks.hdfsSink.hdfs.path =
hdfs://ibmclass:9000/user/biadmin/flume/%y-%m-%d/%H%M
agent86.sinks.hdfsSink.hdfs.filePrefix = Log
agent86.sinks.hdfsSink.hdfs.writeFormat = Text
agent86.sinks.hdfsSink.hdfs.fileType = DataStream
agent86.channels.memChannel.type = memory
agent86.channels.memChannel.capacity = 100
agent86.sources.avroSource.channels = memChannel
agent86.sinks.hdfsSink.channel = memChannel
__ 6. Save your work into File System->opt->ibm->biginsights->flume->conf and call
the file flume_agents.properties.

Section 5: Test your configuration


You are going to have to be working with a number of terminal windows. When you open a
terminal window, you can click Terminal->Set Title to change the title of your window.

agent13
__ 7. Open a terminal window. Change to the flume directory.
cd $BIGINSIGHTS_HOME/flume
__ 8. When you start agent13, even though you coded your configuration statements
correctly, you will see what looks like a Java exception when you start the agent.
That is because the avro sink is not able to connect to the source yet. Once agent99

5-8

InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V5.4
Instructor Exercises Guide

EXempty

starts and the avro source does its bind, you should see a statement something like
the following:
13/08/22 17:25:34 INFO sink.AvroSink: Avro sink avroSink: Building RpcClient
with hostname: ibmclass, port: 10013
Execute the following:
bin/flume-ng agent -n agent13 -f conf/flume_agents.properties
-Dflume.root.logger=INFO,console

agent99
__ 9. Open a terminal window. Change to the flume directory.
cd $BIGINSIGHTS_HOME/flume
__ 10. When you start agent99, even though you coded your configuration statements
correctly, you will see what looks like a Java exception when you start the agent.
That is because the avro sink is not able to connect to the source yet. Once agent86
starts and the avro source does its bind, you should see a statement something like
the following:
13/08/22 17:26:44 INFO sink.AvroSink: Avro sink avroSink: Building RpcClient
with hostname: ibmclass, port: 10099
Execute the following:
bin/flume-ng agent -n agent99 -f conf/flume_agents.properties
-Dflume.root.logger=INFO,console

agent86
__ 11. Open a terminal window. Change to the flume directory.
cd $BIGINSIGHTS_HOME/flume
__ 12. Start agent86. You should see a statement as follows:
13/08/22 17:26:42 INFO source.AvroSource: Avro source avroSource started.
Execute the following:
bin/flume-ng agent -n agent86 -f conf/flume_agents.properties
-Dflume.root.logger=INFO,console

Section 6: Transfer data


Assuming that all of your agents have properly started, you need to test the moving of data
from agent13 into HDFS.
__ 13. Open another terminal window.

Copyright IBM Corp. 2012, 2013

Exercise 5. Configuring Flume for Data Loading

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

5-9

Instructor Exercises Guide

__ 14. Change to your home directory.


cd ~
__ 15. Create a file with test data.
cat > test.txt
this is some data
to upload to hadoop
cntl-c
__ 16. Copy your file to the sourcedata directory.
cp test.txt sourcedata
__ 17. As soon as the file was added to the sourcedata directory it gets processed. List the
contents of the sourcedata directory. You should see that the test.txt file has been
renamed to test.txt.COMPLETED
ls sourcedata
__ 18. Next check to see that the data was move to HDFS. Return to the terminal window
where you started agent86. You should see some statements indicating that a file
was created as a temporary file and then has been renamed. Notice that the
directory structure has been made up of some data and time information.
__ 19. View the contents of the newly created file. Return to the terminal window where you
created the text.txt file. Execute the following replacing the file name with your file
name. You can do a copy and paste of the file name from the terminal window for
agent86.)
hadoop fs -cat flume/13-08-22/1727/log.1377206880008
__ 20. Execute cntl-z in each of the three windows where the agents are running in order to
terminate them.
__ 21. You can close your open terminal windows.
__ 22. Stop Hadoop
cd $BIGINSIHGTS_HOME/bin
./stop.sh hadoop

End of Exercise

5-10 InfoSphere BigInsights Foundation

Copyright IBM Corp. 2012, 2013

Course materials may not be reproduced in whole or in part


without the prior written permission of IBM.

V8.1

backpg

Back page

Vous aimerez peut-être aussi