Vous êtes sur la page 1sur 96

Big Data 2016 Implementation Specialist Bootcamp

Exercises

Safe Harbor Statement


The following is intended to outline our general product direction. It is intended for information purposes only, and
may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality,
and should not be relied upon in making purchasing decisions.
The development, release, and timing of any features or functionality described for Oracles products remains at
the sole discretion of Oracle.

Credits
This bootcamp is based largely on the Big Data Workshop created by Bruce Nelson, several Oracle By Example
tutorials available in the Big Data Learning Library and examples in My Oracle Support.

Oracle Training Materials Usage Agreement


Use of this Site (Site) or Materials constitutes agreement with the following terms and conditions:
1. Oracle Corporation (Oracle) is pleased to allow its business partner (Partner) to download and copy the
information, documents, and the online training courses (collectively, Materials") found on this Site. The use of
the Materials is restricted to the non-commercial, internal training of the Partners employees only. The Materials
may not be used for training, promotion, or sales to customers or other partners or third parties.
2. All the Materials are trademarks of Oracle and are proprietary information of Oracle. Partner or other third party
at no time has any right to resell, redistribute or create derivative works from the Materials.
3. Oracle disclaims any warranties or representations as to the accuracy or completeness of any
Materials. Materials are provided "as is" without warranty of any kind, either express or implied, including without
limitation warranties of merchantability, fitness for a particular purpose, and non-infringement.
4. Under no circumstances shall Oracle or the Oracle Authorized Delivery Partner be liable for any loss, damage,
liability or expense incurred or suffered which is claimed to have resulted from use of this Site of Materials. As a
condition of use of the Materials, Partner agrees to indemnify Oracle from and against any and all actions, claims,
losses, damages, liabilities and expenses (including reasonable attorneys' fees) arising out of Partners use of the
Materials.
5. Reference materials including but not limited to those identified in the Boot Camp manifest cannot be
redistributed in any format without Oracle written consent.

Table of Contents
ABOUT THIS BOOTCAMP

A. CHECKPOINT QUIZZES
B. SCREEN SCRAPES
C. WORKSHOP CONVENTIONS
D. COMMAND CONVENTIONS

4
4
4
4

LAB A: CONNECT TO THE ENVIRONMENT

TASK 1A: DOWNLOAD VM AND CONNECT TO THE ENVIRONMENT


TASK 1B: COPYING EXERCISE FILES TO BIGDATALITE VM
TASK 2A: CONNECT TO THE OPNEC ENVIRONMENT
TASK 2B: CONNECT TO THE BIG DATA LITE VIRTUAL MACHINE
TASK 2C: COPYING EXERCISE FILES IN OPNEC
TASK 3: EXPLORING THE BIGDATALITE VM
TASK 4: VIEWING FILES
TASK 5: LAB FIXES FOR BIGDATALITE 4.2.1 VM
LAB SUMMARY

5
6
9
9
11
15
17
18
19

LAB B: INTRODUCTION TO HADOOP

20

TASK 1: COMPILE A WORD COUNT JOB IN JAVA


TASK 2: COMPILE WORD COUNT CODE AND MOVE FILES INTO HDFS
TASK 3: EXECUTE MAPREDUCE JOB
LAB SUMMARY

20
22
23
27

LAB C: HIVE

28

TASK 1: UPLOAD FILES INTO HDFS


TASK 2: CREATE TABLES IN HIVE
TASK 3: LOAD DATA INTO TABLES
TASK 4: RUN QUERIES AND AGGREGATE DATA
LAB SUMMARY

28
29
30
31
36

LAB D: APACHE PIG OVERVIEW

37

TASK 1: LOAD CASH ACCOUNT DATA INTO HDFS


TASK 2: RUN PIG SCRIPT TO AGGREGATE DATA
TASK 3: STORE DATA INTO HDFS
LAB SUMMARY

37
38
43
44

LAB E: ORACLE NOSQL

46

TASK 1: INSERT AND RETRIEVE A SIMPLE KEY VALUE PAIR FROM THE NOSQL DATABASE
TASK 2: RETRIEVE MULTIPLE VALUES AT THE SAME TIME
TASK 3: USE ORACLE EXTERNAL TABLES TO READ DATA FROM NOSQL
TASK 4: CREATE EXTERNAL TABLES IN ORACLE DB
TASK 5: LOAD DATA INTO NOSQL
TASK 6: AGGREGATE USING HADOOP MAPREDUCE AND STORE IN HDFS
LAB SUMMARY

46
49
51
52
58
61
66

LAB F: ORACLE LOADER FOR HADOOP

67

TASK 1: LOAD HDFS DATA INTO THE ORACLE DATABASE


TASK 2: LOAD THE DATA FROM HDFS -> ORACLE DB
TASK 3: LOAD THE DATA FROM HDFS -> ORACLE DB
LAB SUMMARY

67
71
72
75

LAB G: ORACLE SQL CONNECTOR FOR HDFS

76

TASK 1: CONFIGURE EXTERNAL TABLES STORED IN HDFS


TASK 2: POINT EXTERNAL TABLE TO DATA IN THE HDFS FILE
TASK 3: USE SQL TO READ DATA FROM MAPPED EXTERNAL TABLE
LAB SUMMARY

76
77
79
80

LAB H: BIG DATA SQL

82

TASK 1: UNDERSTAND THE BIG DATA SQL CONFIGURATION


TASK 2: CREATE THE CORRESPONDING ORACLE DIRECTORY OBJECTS
TASK 3: CREATE ORACLE TABLE OVER APPLICATION LOG IN HDFS
TASK 4: REVIEW TABLES STORED IN HIVE
TASK 5: USE ORACLE BIG DATA SQL TO CREATE AND QUERY HIVE TABLES
LAB SUMMARY

82
83
84
85
89
93

APPENDIX

94

START CLOUDERA SERVICE


START CLOUDERA AGENT
VARIOUS UTILITIES IN BIGDATALITE VM

94
94
94

About this Bootcamp


Objective: Learn about the conventions used in this workbook

! 5 minutes

A. Checkpoint Quizzes
There will be various checkpoint quizzes throughout the
workshop. This helps your workshop instructor gauge the
level of overall understanding in the workshop. This is also
being done to ensure that participants are not typing
commands without understanding the purpose behind the
command. If you cannot answer a question in the Checkpoint Quiz it may mean you should slow down
and read the material to get a more thorough understanding.

B. Screen Scrapes
[oracle@bigdatalite ~]$ ls
Desktop
exercises
Music
Documents GettingStarted oradiag_oracle
Downloads movie
Pictures

Public
scripts
Templates

Videos
WordCount.jar

Screen scrapes of expected command results are also included in various places throughout the lab. Generally
the results on your screen should match but there may be occasions where there is slight deviation. If you notice
a major deviation, please consult your workshop facilitator.

C. Workshop Conventions
Any text highlighted with this icon is very important. Failure to read may affect the results in your lab.

Best practices and important information are highlighted by this icon.

D. Command Conventions
[user@bigdatalite]$
@test.sql
Denotes a command that should be run at the Linux command line as user
SQL> select x from dual;
Denotes a command that should be run from within sqlplus.
In general, commands will be denoted in BOLD. Please do not just look for the BOLD text and click
through, the purpose of this workshop is for you to leave with an understanding of the material. Read
through the material thoroughly.
Remember to read the scripts! There are no prizes for finishing first. Todays goal is to provide you with
hands on experience with the technology. This is not a typing exercise.

Lab A: Connect to the Environment


Objective: Connect to OPNEC environment and login to the Big Data
Lite VM

! 20 minutes

Instructions:
Each boot camp participant will be either be connecting to the OPNEC environment where multiple Big Data Lite
VMs are staged or they will be downloading the Big Data Lite VM and running it locally on their laptop or a server
they have access too. Your instructor will let you know whether this is a hosted workshop (Task 1) or not (Task
2). If this is hosted in the OPNEC, please skip to Task 2a.
Task 1a: Download VM and Connect to the Environment
1. Go to the Big Data Lite Virtual Machine page and prepare to download the virtual machine. Currently,
the 4.2.1 version is supported for this lab.

Ensure you have a laptop or server with at least 8GB of real memory
2. Download all the zip files from OTN. You will need about 50GB of disk space to download and install

3. Download and install 7zip


a. Windows users can get 7zip from www.7-zip.org
b. Mac users can get 7zip from
4. Download AND install Oracle VirtualBox and the VirtualBox Extension Pack for your platform from
Oracle Technology Network. The VirtualBox Extension pack is necessary for uploading the exercise
files into VirtualBox
a. http://www.oracle.com/technetwork/server-storage/virtualbox/downloads/index.html

5. After all zip files are downloaded, extract them with 7zip
a. In Linux, using the command below
i. 7za e BigDataLite-xxx.7z.001
b. In Windows, right click on the file BigDataLite-xxx.7z.001 file and select 7-zip-> Extract Here. Be
patient, the extract may take some time.
5

c.

For Mac, download the p7zip binaries from softonic. Install the binaries and extract using the
linux command above.
i. When extracting, make sure you use the 7za binary to do the extraction

The extraction will create the BigDataLite---xxx.ova appliance file. This file will be used to create a new
machine in Virtual Box. This single file contains the entire machine definition, including the physical disks as
well as defaults for the machine configuration.
6. Start Virtual Box Manager
7. Click File -> Import Appliance to launch the import wizard
8. Click Open Appliance
9. Locate the BigDataLite---xxx.ova file and click Open. Click Next and accept all defaults
10. Click Import
11. Double click on BigDataLite---xxx to start the VM. Log on as the oracle user to get started:
oracle/welcome1
Task 1b: Copying Exercise Files to BigDataLite VM
Before we begin the actual labs. The files needed for this workshop need to be copied to the VirtualBox
VM. Your instructor will provide you with the zipped exercises file either via jump drive or email.

Skip this step if you are using a Big Data Lite VM hosted in OPNEC
1. Go to Machine -> Settings in VirtualBox.

2. Click on Shared Folders

3. Click on the add button


in the upper right corner. You will be adding a shared folder that
will give you access to the files in VirtualBox.
4. In the Add Share window, click on the down arrow next to Folder Path. Select Other.

5. Navigate to the directory where you saved the exercise.tar or exercise.zip file. Click Choose
6. The Add Share window will now be populated for you. Change the Folder name to stage.
7. Click on Make Permanent, click OK.
8. Verify that the new Shared Folder has been setup. Click OK.

9. You now need to mount this shared folder to your Big Data Lite VM. Open up a terminal window
in your Big Data Lite VM.
10. Switch to the root user (password: welcome1) and execute the rest of the commands to mount
this share
[oracle@bigdatalite ~]$ su
Password: welcome1
[root@bigdatalite ~]$ mkdir /vboxmount
[root@bigdatalite ~]$ mount t vboxsf stage /vboxmount
[root@bigdatalite ~]$ df kh
[root@bigdatalite ~]$ exit

11. At the end of the df kh above, you should see a new stage filesystem mounted on /vboxmount
12. As the oracle user now copy the files over and change the permissions.

Make sure that you have exited the root user and are now the oracle user.
[oracle@bigdatalite
[oracle@bigdatalite
[oracle@bigdatalite
[oracle@bigdatalite

~]$
~]$
~]$
~]$

cd /vboxmount
cp exercises.zip /home/oracle
cd /home/oracle
unzip exercises.zip

13. Proceed to Task 3

Task 2a: Connect to the OPNEC environment


Assumptions:

Your instructor has provided you with the IP address, your user number and VNC port number for the
hands on systems. Do not proceed without this information.

You have installed VNC viewer (or equivalent software) on your laptop or the provided machine.

Connect your laptop (or provided machine) to the network supplied by your instructor. Fill in the IP Address
information in the table below.
MACHINE NAME: ______________________________________
USER #: _________________________

VNC PORT #: ______________________________

VPN EMAIL: __________________________________ VPN PASSWORD: ___________________________


Account

Username

Password

VPN

user1

User___vnc

User

user1

User___ssh

Please do not proceed until you have filled in all the password information for this lab. Failure to login to the
correct lab may lead to issues for your fellow participants.

Task 2b: Connect to the Big Data Lite Virtual Machine


1. Open a terminal session in the VNC session to the virtual machine. To connect to the server start the
already installed VNC client and connect to the IP address using the VNC port from the previous page
(screen size 1024x768). You may get an error about the transmission not being encrypted. Ignore this
error and click Continue

Enter the user


number you were
assigned

2. You are now connected to the virtual machine as the your assigned user. Please make sure you are
using the number given by your instructor.
3. Enter the VNC password on the previous page. Replace the ### with your user number.
9

User###vnc

Do not enter ### as the password. Replace the ### with your assigned user # from your instructor.

4. Your screen may be locked. If it is, or you are prompted for a password enter the password for your user
account. Replace the ### with your user number.
User###ssh

Do not enter ### as the password. Replace the ### with your assigned user # from your instructor.

5. Go to Applications -> Accessories -> Terminal and start up a terminal session.


6. If virtualbox is not already started, start up the virtualbox server. Ignore any warnings
[oracle@bigdatalite] virtualbox &

7. If the BigDataLite VM is not running, press the start button to start it. The VM may take 5-10 minutes to
start up, please be patient.

10

8. Once you receive the window below, enter the oracle password (welcome1)
Task 2c: Copying Exercise Files in OPNEC
Before we begin the actual labs. The files needed for this workshop need to be copied from the
/BD_USER/profile/stage directory mounted on each VirtualBox VM.

Skip this step if you are using a Big Data Lite VM you downloaded to your own laptop or server.
nd

1. Go to Machine -> Settings (Machine is located above the 2 Applications menu).


Note: Remember we are working on a VM inside of VirtualBox.

2. Click on Shared Folders

11

3. If the folder already exists, skip to step 9.


4. Click on the add button
in the upper right corner. You will be adding a shared folder that
will give you access to the files in VirtualBox.
5. In the Add Share window, click on the down arrow. Select Other.

6. Navigate to the /BD_USER1/profile/stage directory. Click Choose

12

7. The Add Share window will now be populated for you. Click on Make Permanent, click OK.

8. Verify that the new Shared Folder has been setup. Click OK.

13

9. You now need to mount this shared folder to your Big Data Lite VM. Switch to the root user
(password: welcome1) and execute the rest of the commands to mount this share. Make sure
you are oracle when you copy over the exercises
[oracle@bigdatalite ~]$ su
Password: welcome1
[root@bigdatalite ~]$ mkdir /vboxmount
[root@bigdatalite ~]$ mount t vboxsf stage /vboxmount
[root@bigdatalite ~]$ df kh
[root@bigdatalite ~]$ exit
[oracle@bigdatalite ~]$ ls /vboxmount
[oracle@bigdatalite ~]$ cp r exercises /home/oracle

14

Task 3: Exploring the BigDataLite VM


For this workshop, we are using the Oracle Big Data Lite VM. This VM is available on the OTN site under the Big
Data Appliance section. Your instructor can provide you the exercises used in this workshop as well for use as
demos or with customers.
Oracle Big Data Lite Virtual Machine provides an integrated environment to help you get started with the Oracle
Big Data platform. Many Oracle Big Data platform components have been installed and configured - allowing you
to begin using the system right away. The following components are included on Oracle Big Data Lite:
Oracle Enterprise Linux 6.5
Oracle Database 12c Release 1 Enterprise Edition (12.1.0.2) - including Oracle Big Data SQL-enabled
external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle
Spatial and Graph, and more.
Cloudera Distribution including Apache Hadoop (CDH5.3.0)
Cloudera Manager (5.3.0)
Oracle Big Data Connectors 4.1
o Oracle SQL Connector for HDFS 3.2.0
o Oracle Loader for Hadoop 3.3.0
o Oracle Data Integrator 12c
o Oracle R Advanced Analytics for Hadoop 2.4.1
o Oracle XQuery for Hadoop 4.1.0
Oracle NoSQL Database Enterprise Edition 12cR1 (3.2.5)
Oracle JDeveloper 12c (12.1.3)
Oracle SQL Developer and Data Modeler 4.0.3
Oracle Data Integrator 12cR1 (12.1.3)
Oracle GoldenGate 12c
Oracle R Distribution 3.1.1
Oracle Perfect Balance 2.3.0
Oracle CopyToBDA 1.1
1. Once you are logged in, there are a number of passwords that are used in this VM. All the passwords
below are not necessarily used in the lab.

Application
Operating System
Operating System
HDFS
Oracle Database
Oracle Database
MySQL Database
Cloudera Manager
RStudio

User
oracle
root
hdfs
sys
BDA
root
admin
oracle

Password
welcome1
welcome1
hdfs
welcome1
welcome1
welcome1
welcome1
welcome1

2. The Big Data Lite VM desktop has everything you need to start accessing and learning about Big Data
concepts.

15

3. To access applications, pay particular attention to the top.

Firefox, Notes, Terminal, Sql Developer, ODI and JDeveloper can be accessed from this top bar.
4. To start and stop the Big Data Lite Services use the Start/Stop Services icon on the desktop. Before you
begin your exercises, make sure ORCL, Zookeeper, HDFS, Hive and YARN are running.

5. Additional services (Yarn) may be towards the bottom of the list. To expand the list click on the (+) button
to advance as shown in the image below.
16

1. Can you download this VM and use it to demo Big Data for your customers or to run workshops of your
own? YES NO
2. Are all services are automatically started? YES NO

Task 4: Viewing Files


1. You should see 3 files: BigDataHOLExercises*.tar, and two sql scripts and now should be back
to the oracle user. We are going to now extract the tarball into the oracle home directory.
[oracle@bigdatalite ~]$ cd
[oracle@bigdatalite ~]$ ls
Desktop
exercises
Documents GettingStarted
Downloads movie

/home/oracle
Music
oradiag_oracle
Pictures

Public
scripts
Templates

Videos
WordCount.jar

2. In each exercises folder there is a file called cleanup.sh. You can run this script to clean up the
results of the exercise in order to run it again if you need to.
[oracle@bigdatalite ~]$ cd exercises
[oracle@bigdatalite exercises]$ cd hive
[oracle@bigdatalite hive]$ ls
cheat.sh
export_credit_department.csv
cleanup.sh export_mortgages_department.csv
[oracle@bigdatalite hive]$ gedit cheat.sh &
[1] 7100

17

Task 5: Lab Fixes for BigDataLite 4.2.1 VM


1. The 4.2.1 version of the Big Data Lite VM contains both JDK v7 and v8. To ensure MapReduce
works properly, enter the following commands below. You will be using the su command to
switch to the root user who owns this file.
[oracle@bigdatalite ~]$ su
welcome1
[root@bigdatalite ~]$ gedit /etc/default/bigtop-utils

2. Add the following line to the file: export JAVA_HOME=/usr/java/latest


3. Reboot the VirtualBox VM
4. Once the VM comes back up, make sure you are logged in as the oracle user. Run the setup
script to create the BDA user for the NOSQL lab and some other fixes. Please be patient, this
script may take a while.
[oracle@bigdatalite ~]$ cd exercises/setup
[oracle@bigdatalite exercises]$ ./setup.sh

18

Lab Summary
The Oracle Big Data Lite Virtual Machine provides an integrated environment to help you get started with the
Oracle Big Data platform. Many Oracle Big Data platform components have been installed and configured allowing you to begin using the system right away.
You should have successfully logged into your Big Data Lite VM, copied the exercises, made the appropriate
modifications and are now ready to begin working with Big Data.
CRM

RDBMS
SQL

Cash

PIG

Credit

HIVE

OSCH
ODI
OLH

Transactions

NoSQL

Database ORE
12c

Insight

ORCH

HDFS
Figure 1: Architecture of Big Data Lite VM

Do not proceed to the next section until you have clearance from your
instructor.

19

Lab B: Introduction to Hadoop


Objective: Upload files into HDFS and set up a Word Count
MapReduce job on a Hadoop Cluster

! 30 minutes

To get an understanding of what is involved in running a Hadoop Job and what are all of the steps one must
undertake we will embark on setting up and running a Hello World type exercise on our Hadoop Cluster.
In this exercise you will:
1)
2)
3)
4)

Compile a Java Word Count written to run on a Hadoop Cluster


Upload the files on which to run the word count into HDFS
Run Word Count
View the Results

Task 1: Compile a Word Count job in Java


1. All of the setup and execution for the Word Count exercise can be done from the terminal, so to start out
this first exercise, please open the terminal by clicking
on the terminal icon in the menu.
Terminal Icon

2. To get into the folder where the scripts for the first exercise are, type in the terminal:
[oracle@bigdatalite hive]$ cd /home/oracle/exercises/wordCount

3. Lets look at the Java code that will run word count on a Hadoop cluster. Type in the terminal:
[oracle@bigdatalite wordCount]$ gedit WordCount.java
A new window will open with the Java code for word count. Take a look at line 14 and 28 of the code. You
can see theyre how the Mapper and Reducer Interfaces are being implemented.
import
import
import
import
import
import
import
import
import
import

org.apache.hadoop.io.*;
org.apache.hadoop.conf.Configuration;
org.apache.hadoop.fs.Path;
org.apache.hadoop.mapreduce.*;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
org.apache.hadoop.util.GenericOptionsParser;
java.io.IOException;
java.lang.InterruptedException;
java.util.StringTokenizer;

public class WordCount {


public static class MyMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
20

while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class MyReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
Job job = new Job(conf, "WordCount");
job.setJarByClass(WordCount.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

4. When you are done evaluating the code you can click on the X in the right upper corner of the screen to
close the window.

1. In the main java function, which operation is run first, Mapper or Reducer? MAPPER REDUCER
2. Why?_______________________________________________________________________________
______________________________________________________________________________________

21

Task 2: Compile Word Count code and move files into HDFS
1. Let us see what is in the wordcount_classes directory. Type in the terminal:
[oracle@bigdatalite wordCount]$ ls wordcount_classes
Note that the directory is empty
2. We can now go ahead and compile the Word Count code. We need to run a command that will set the
correct classpath and output directory while compiling WordCount.java. Type in the terminal:

[oracle@bigdatalite wordCount]$ javac -cp


/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/* -d wordcount_classes
WordCount.java
[oracle@bigdatalite wordCount]$ ls
t
copyFiles.sh
file01
viewResults.sh
cleanup.sh createJar.sh
file02
wordcount_classes
compile.sh deleteOutput.sh runWordCount.sh WordCount.java
[oracle@bigdatalite wordCount]$ ls wordcount_classes

If you get an error, it is likely a typing mistake. Check and try again.
3. We can now create a jar file from the compile directory of Word Count. This jar file is required as the code
for word count will be sent to all of the nodes in the cluster and the code will be run simultaneous on all
nodes that have appropriate data. To create the jar file, go to the terminal and type the line below.

There is a period at the end of the jar below. Make sure you type the period after the
wordcount_classes/
[oracle@bigdatalite wordCount]$ jar -cvf WordCount.jar -C wordcount_classes/ .
added manifest
adding: WordCount.class(in = 1604) (out= 845)(deflated 47%)
adding: WordCount$MyReducer.class(in = 1633) (out= 687)(deflated 57%)
adding: WordCount$MyMapper.class(in = 1722) (out= 753)(deflated 56%)
4. Lets proceed to view the files on which we will perform the word-count. They contain customer comments
that will be the base of our sentiment analysis. To view the content of the two files, go to the terminal and
enter the command in bold.
Our goal is to see the opinions of the customers when it comes to the car insurance services
that our company has to offer.
[oracle@bigdatalite wordCount]$ cat file01 file02
very disappointed and very expensive
expensive and unreliable insurance
worthless insurance and expensive
worst customer service
worst insurance company
worst professional staff and unreliable insurance company
insurance is very expensive

disappointed with protocols


unreliable insurance
22

best service I recommend it


good professionals and efficient insurance
I will recommend it
good customer service
best insurance I found I recommend it
5. Now that we have the files ready we must move them into the Hadoop Distributed File System (HDFS). It
is important to note that files that are within HDFS are split into multiple chunks and stored on separate
nodes for parallel parsing. To upload our two files into the HDFS you need to use the copyFromLocal
command in Hadoop. Run the commands by typing at the terminal:
[oracle@bigdatalite wordCount]$ hadoop fs -ls /user/oracle/wordcount
[oracle@bigdatalite wordCount]$ hadoop fs -copyFromLocal file01
/user/oracle/wordcount/input/file01

6. We should now upload the second file. Go to the terminal and type:
[oracle@bigdatalite wordCount]$ hadoop fs -copyFromLocal file02
/user/oracle/wordcount/input/file02

Task 3: Execute MapReduce job


1. We can now run our Map/Reduce job to do a word count on the files we just uploaded using the files we
compiled in Task 1. Go to the terminal and type:
[oracle@bigdatalite wordCount]$ hadoop jar WordCount.jar WordCount
/user/oracle/wordcount/input /user/oracle/wordcount/output
15/07/15 03:10:19 INFO client.RMProxy: Connecting to ResourceManager at
localhost/127.0.0.1:8032
15/07/15 03:10:21 INFO input.FileInputFormat: Total input paths to process : 2
15/07/15 03:10:21 INFO mapreduce.JobSubmitter: number of splits:2
15/07/15 03:10:22 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1436834454401_0001
15/07/15 03:10:23 INFO impl.YarnClientImpl: Submitted application
application_1436834454401_0001
15/07/15 03:10:23 INFO mapreduce.Job: The url to track the job:
http://bigdatalite.localdomain:8088/proxy/application_1436834454401_0001/
15/07/15 03:10:23 INFO mapreduce.Job: Running job: job_1436834454401_0001
....
15/07/15 03:10:57 INFO mapreduce.Job: Job job_1436834454401_0001 completed
successfully
15/07/15 03:10:57 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=1180
FILE: Number of bytes written=316285
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=942
HDFS: Number of bytes written=282
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=17486
23

Total time spent by all reduces in occupied slots (ms)=6324


Total time spent by all map tasks (ms)=17486
Total time spent by all reduce tasks (ms)=6324
Total vcore-seconds taken by all map tasks=17486
Total vcore-seconds taken by all reduce tasks=6324
Total megabyte-seconds taken by all map tasks=4476416
Total megabyte-seconds taken by all reduce tasks=1618944
Map-Reduce Framework
Map input records=22
Map output records=87
Map output bytes=1000
Map output materialized bytes=1186
Input split bytes=270
Combine input records=0
Combine output records=0
Reduce input groups=30
Reduce shuffle bytes=1186
Reduce input records=87
Reduce output records=30
Spilled Records=174
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=307
CPU time spent (ms)=2930
Physical memory (bytes) snapshot=724774912
Virtual memory (bytes) snapshot=2660659200
Total committed heap usage (bytes)=525336576
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=672
File Output Format Counters
Bytes Written=282
Notice that the map job executes first then reduce
A lot of text will role by in the terminal window. This is informational data coming from the Hadoop
infrastructure to help track the status of the job. Wait for the job to finish, this is signaled by the command
prompt coming back.
2. Once you get the command prompt back, your Map/Reduce task is complete. It is now time to look at the
results. We can display the contents of the results file right from HDFS by using the cat command from
Hadoop. Go to the terminal and type the following command:
[oracle@bigdatalite wordCount]$ hadoop fs -cat
/user/oracle/wordcount/output/part-r-00000
I
4
and
7
awful 1
bank 1
best 2
company
2
24

cover 1
customer
4
disappointed
efficient
1
expensive
6
found 1
good 2
insurance
11
is
1
it
3
professional
professionals
protocols
1
recommend
3
service
8
staff 1
terrible
2
the
1
unreliable 3
very 3
will 1
with 2
worst 8
worthless
2

1
1

In the terminal the word count results are displayed. You will see the job performed a count of the number
of times each word appeared in the text files.
We can see that most of the customer comments are negative, as we encounter the word
worst 8 times, the word unreliable 3 times, the word expensive 6 times, the word
worthless 2 times and the word good only 2 times.
These are key-value pairs
It is important to note this is a very crude sentiment analysis and this exercise was designed more to give
an introduction to map reduce than enable a comprehensive sentiment analysis. Most tools that do
sentiment analysis perform context dependent word comparison where the entire sentence is taken into
consideration rather than each work in isolation. A Support Vector Machine would be implemented or a
specialized tool such as Endeca would give a very well designed sentiment analysis.
3. As an experiment lets try to run the Hadoop job again. Go to the terminal and type:
[oracle@bigdatalite wordCount]$ hadoop jar WordCount.jar WordCount
/user/oracle/wordcount/input /user/oracle/wordcount/output
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at WordCount.main(WordCount.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
ava:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
4. You will notice that an error message appears and no map reduce task is executed. This is easily
explained by the immutability of data. Since Hadoop does not allow an update of data files (just read and
25

write) you cannot update the data in the results directory hence the execution has nowhere to place the
output. For you to re-run the Map-Reduce job you must either point it to another output directory or clean
out the current output directory. Lets go ahead and clean out the previous output directory. Go to the
terminal and type:
[oracle@bigdatalite wordCount]$ hadoop fs -rm -r /user/oracle/wordcount/output
15/07/15 03:18:38 INFO fs.TrashPolicyDefault: Namenode trash configuration:
Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/oracle/wordcount/output

5. Now we have cleared the output directory and can re-run the map reduce task. Lets just go ahead and
make sure it works again. Go to the terminal and type:
[oracle@bigdatalite wordCount]$ hadoop jar WordCount.jar WordCount
/user/oracle/wordcount/input /user/oracle/wordcount/output
15/07/15 03:19:53 INFO client.RMProxy: Connecting to ResourceManager at
localhost/127.0.0.1:8032
15/07/15 03:19:54 INFO input.FileInputFormat: Total input paths to process : 2
15/07/15 03:19:54 INFO mapreduce.JobSubmitter: number of splits:2
15/07/15 03:19:55 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1436834454401_0002
15/07/15 03:19:55 INFO impl.YarnClientImpl: Submitted application
application_1436834454401_0002
15/07/15 03:19:55 INFO mapreduce.Job: The url to track the job:
http://bigdatalite.localdomain:8088/proxy/application_1436834454401_0002/
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=672
File Output Format Counters
Bytes Written=282
Now the Map Reduce job ran fine again as per the output on the screen.
This completes the word count example. You can now close the terminal window; go to the terminal
window and type:
[oracle@bigdatalite wordCount]$ exit

26

Lab Summary

Figure 2: Map Reduce Architecture

In this exercise you were able to see the basic steps required in setting up and running a very simple Map
Reduce Job. You saw what interfaces must be implemented when creating a Map/Reduce task, you saw
how to upload data into HDFS and how to run the map reduce task.
It is important to talk about execution time for the exercise, as the amount of time required to count a few
words is quite high in absolute terms. It is important to understand that Hadoop needs to start a separate
Java Virtual Machine to process each file or chunk of a file on each node of the cluster. As such even a
trivial job has some processing time, which limits the possible application of Hadoop as it can only handle
batch jobs. Real time applications where answers are required immediately cant be run on a Hadoop
cluster. At the same time as the data volumes increase, processing time does not increase that much as
long as there are enough processing nodes.
A recent benchmark of a Hadoop cluster saw the complete sorting of 1 terabyte of data in just over 3
minutes on 910 nodes.

Do not proceed to the next section until you have clearance from your
instructor.

27

Lab C: Hive
Objective: Use HiveQL to create and insert data into tables and issue
queries

! 40 minutes

In this exercise you will use Hive Query Language to create tables, insert data into those tables and run queries
on that data. For this exercise we will use two data files that contain details related to customer credits and
mortgages. The purpose is to perform some aggregations on the data and store them into a Hive table. The
content of this table will be then loaded into the Oracle Database via Oracle Data Integrator in a later exercise.
In this exercise you will:
1. Upload files into HDFS
2. Create tables in Hive
3. Load the data into the Hive tables
4. Run queries on the Hive tables and create tables that will hold our final aggregated data
Task 1: Upload files into HDFS
1. All of the setup and execution for this exercise can be done from the terminal, hence open a terminal by
clicking on the Terminal icon on the desktop.

2. Make sure your hive service is running.


3. To get into the folder where the scripts for the Hive exercise are, type in the terminal:
[oracle@bigdatalite wordCount]$ cd /home/oracle/exercises/hive
4. Lets take a look at the first file that holds the credits of customers. Go to the terminal and type
[oracle@bigdatalite hive]$ head export_credit_department.csv
CU7854,1500,0
CU12993,2500,0
CU608,1500,0
CU10025,1500,0
CU11680,1500,0
CU10706,1000,0
CU6752,800,4315
CU6932,800,0
CU9555,1500,0
CU475,2000,0
The comma-delimited file contains the customer id, followed by the credit card limit and the credit
balance, for each customer.
Customer ID 7854 has a credit limit of 1500 and a balance of 0
5. Lets take a look at the mortgages file that holds the customer mortgages. Go to the terminal and type
[oracle@bigdatalite hive]$ head export_mortgages_department.csv
CU4155,4000
28

CU1005,1200
CU5403,1536
CU15326,3000
CU10154,1500
CU14434,0
CU6560,1100
CU12842,2144
CU6165,3500
CU11570,0
Customer ID 4155 has a mortgage of 4000.00

This file holds two columns, delimited by a comma. The first column is the customer id, and the second
column is the mortgage amount. In case a customer has two mortgages there will be two rows for the
same customer in the file. In case a customer has no mortgages the mortgage amount will be 0.
To reduce the errors with typing, feel free to open up cheat.sh using gedit. Dont forget to
go through each command to understand what is being done.
6. Lets proceed to copy the two files in HDFS. We will first copy the credits file into HDFS. Go to the
terminal and type:
[oracle@bigdatalite hive]$ hadoop fs -copyFromLocal
export_credit_department.csv /user/oracle/export_credit_department
7. We will now copy the mortgages file into HDFS. Go to the terminal and type:
[oracle@bigdatalite hive]$ hadoop fs -copyFromLocal
export_mortgages_department.csv /user/oracle/export_mortgages_department
Task 2: Create tables in Hive
1. Go to Start/Stop services. Ensure Hive is started
2. Lets now enter the Hive interactive shell environment to create tables and run queries against those
tables. To give an analogy, this is similar to SQL*Plus but its specifically designed for the HiveQL
language. To enter the environment go to the terminal and type:
[oracle@bigdatalite wordCount]$ cd /home/oracle/exercises/hive
[oracle@bigdatalite hive]$ hive
15/07/15 03:27:49 WARN conf.HiveConf: DEPRECATED: Configuration property
hive.metastore.local no longer has any effect. Make sure to provide a valid
value for hive.metastore.uris if you are connecting to a remote metastore.
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hivecommon-0.13.1-cdh5.3.0.jar!/hive-log4j.properties

3. The first thing we need to do in Hive is to create the tables that will hold our credit and mortgages data.
We will create a table named credit_department with three fields called customer_id, credit_card_limits
and credit_balance, something that looks very natural based on the data set. In the create table command
we will specify that the fields in the HDFS file are delimited by commas. Go the terminal and type:
hive> create table credit_department (CUSTOMER_ID string, CREDIT_CARD_LIMITS
int, CREDIT_BALANCE int) ROW FORMAT delimited fields terminated by ',';
OK
Time taken: 3.357 seconds
Then press Enter
29

To reduce the errors with typing, feel free to open up cheat.sh using gedit. Dont forget to
go through each command to understand what is being done.

An OK should be printed on the screen indicating the success of the operation. This OK message will be
printed for all operations but we will only mention it this time. It is left up to the user to check for this
message for future HiveQL commands.
4. We will now create the second table called mortgages_department, with two columns, the customer_id
and the mortgage_sum. Go to the terminal and type:
hive> create table mortgages_department (CUSTOMER_ID string, MORTGAGE_SUM
float) ROW FORMAT delimited fields terminated by ',';
OK
Time taken: 1.134 seconds
Then press Enter
5. We can now run a command to see all of the tables available to this OS user. Go to the Hive terminal and
type:
hive> show tables;
OK
credit_department
cust
mortgages_department
movie
movie_rating
movie_updates
movie_view
movieapp_log_avro
movieapp_log_json
movieapp_log_odistage
movieapp_log_stage
movielog
session_stats
user_movie
Time taken: 0.175 seconds, Fetched: 14 row(s)
The two previously created tables are visible in the output of the show tables command. There are two
other tables already created but we can ignore those for this exercise.
Task 3: Load data into tables
1. Lets go ahead and load some data into the two tables. Data is loaded into Hive tables from flat files
available in the HDFS file system. In our case, these files were just loaded into HDFS during the previous
steps of the exercise. We will first load data into the credit_department table. Go the terminal and type:
hive> load data inpath '/user/oracle/export_credit_department' into table
credit_department;
Loading data to table default.credit_department
OK
Time taken: 1.192 seconds

Then press Enter


30

2. Next, we will load the data into the mortgages_department table. Go the terminal and type:
hive> load data inpath '/user/oracle/export_mortgages_department' into table
mortgages_department;
Loading data to table default.mortgages_department
OK
Time taken: 0.461 seconds
Then press Enter
The data that we first copied into HDFS using the copyFromLocal command is now loaded
into the two Hive tables.
6. We can now see the data that has been loaded into the credit_department table. Go to the terminal and
type:
hive> select * from credit_department limit 5;
OK
CU7854
1500 0
CU12993
2500 0
CU608 1500 0
CU10025
1500 0
CU11680
1500 0
Time taken: 0.68 seconds, Fetched: 5 row(s)
The first five rows in the credit_department table have been displayed in the terminal.
Task 4: Run queries and aggregate data
1. Next, we will perform an aggregation of the mortgage data to see, for each customer, how many
mortgages they have and what the total amount of their mortgages is. Go to the terminal and type the
following command below. Be patient, it could take some time depending on your systems resources.
hive> select customer_id,count(*),sum(mortgage_sum) from mortgages_department
group by customer_id;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1436834454401_0004, Tracking URL =
http://bigdatalite.localdomain:8088/proxy/application_1436834454401_0004/
CU9820
1
1200.0
CU9823
1
0.0
CU9839
1
700.0
CU985 1
4000.0
CU9863
1
2500.0
CU9868
1
700.0
CU9897
1
1000.0
CU9913
1
1000.0
CU9932
1
0.0
CU9964
1
1100.0
CU998 1
2500.0
31

CU9988
1
600.0
Time taken: 36.174 seconds, Fetched: 1023 row(s)
Then press Enter
The customer data is displayed in the terminal. We can see the customer_id, followed by the number of
mortgages and the total amount for these mortgages.

When you issued a select in hive, how was the behavior different from a sql
query?______________________________________________________________________________
____________________________________________________________________________________
2. It is now useful to create a table with these results, using the SQL-like command CREATE TABLEAS
SELECT. Go to the terminal and type:
hive> create table mortgages_department_agg as select customer_id,count(*) as
n_mortgages,sum(mortgage_sum) as mortgage_amount from mortgages_department
group by customer_id;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1436834454401_0005, Tracking URL =
http://bigdatalite.localdomain:8088/proxy/application_1436834454401_0005/
. . .
Hadoop job information for Stage-1: number of mappers: 1; number of reducers:
1
2015-07-15 03:43:30,761 Stage-1 map = 0%, reduce = 0%
2015-07-15 03:43:39,745 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.16
sec
2015-07-15 03:43:49,406 Stage-1 map = 100%, reduce = 100%, Cumulative CPU
4.76 sec
MapReduce Total cumulative CPU time: 4 seconds 760 msec
Ended Job = job_1436834454401_0005
Moving data to:
hdfs://bigdatalite.localdomain:8020/user/hive/warehouse/mortgages_department_a
gg
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1
Cumulative CPU: 4.76 sec
HDFS Read: 12539
HDFS Write: 15771 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 760 msec
OK
Time taken: 33.326 seconds
Then press Enter
This creates a table that counts the number of mortgage by customer and totals
32

3. We are now ready to create our final table, called consolidated_credit_products, by joining the credits
table with the aggregated mortgages table, again using a CREATE TABLE AS SELECT command.
Go to the terminal and type:
hive> create table consolidated_credit_products as select
hcd.customer_id,hcd.credit_card_limits,hcd.credit_balance,hmda.n_mortgages,hmd
a.mortgage_amount from mortgages_department_agg hmda join credit_department
hcd on hcd.customer_id=hmda.customer_id;
Total jobs = 1
15/07/15 03:46:49 WARN conf.Configuration: file:/tmp/oracle/hive_2015-0715_03-46-44_643_1193152908650882103-1/-local-10006/jobconf.xml:an attempt to
override final parameter: mapreduce.job.end-notification.max.retry.interval;
Ignoring.
15/07/15 03:46:49 WARN conf.Configuration: file:/tmp/oracle/hive_2015-0715_03-46-44_643_1193152908650882103-1/-local-10006/jobconf.xml:an attempt to
override final parameter: mapreduce.job.end-notification.max.attempts;
Ignoring.
15/07/15 03:46:50 WARN conf.HiveConf: DEPRECATED: Configuration property
hive.metastore.local no longer has any effect. Make sure to provide a valid
value for hive.metastore.uris if you are connecting to a remote metastore.
Execution log at: /tmp/oracle/oracle_20150715034646_93c48253-1fb9-415e-b4edbc37982adb4f.log
MapReduce Total cumulative CPU time: 2 seconds 540 msec
Ended Job = job_1436834454401_0006
Moving data to:
hdfs://bigdatalite.localdomain:8020/user/hive/warehouse/consolidated_credit_pr
oducts
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1
Cumulative CPU: 2.54 sec
HDFS Read: 16014 HDFS
Write: 22859 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 540 msec
OK
Time taken: 33.475 seconds
Then press Enter
4. As with normal SQL you also have a describe command available to see the columns in the table. Go the
terminal and type:
The structure of the table is displayed in the terminal, so that we can see the column names and the data
type for each column.
hive> desc consolidated_credit_products;
OK
customer_id
string
credit_card_limits
int
credit_balance
int
n_mortgages
bigint
mortgage_amount
double
Time taken: 0.275 seconds, Fetched: 5 row(s)

We will now proceed to display the data from the newly created Hive table. Go to the terminal and type:
hive> select * from consolidated_credit_products limit 5;
OK
CU0001
1000 0
1
-1000.0
CU0002
2000 0
1
-2000.0
CU0003
3000 0
1
-3000.0
33

CU0004
4000 0
1
-4000.0
CU0005
5000 0
1
-5000.0
Time taken: 0.172 seconds, Fetched: 5 row(s)
Then press Enter
The data is visible in Hive. The customer_id is followed by the credit limit and the credit balance. Then,
the number of mortgages is displayed, followed by the total amount of mortgages.
Note that customer 1 has a card limit of 1000.00, a credit balance of 0 and 1 mortgage that
is 1000.00
5. Our final exercise is to create a Hive view of the same query that we used to create
consolidated_credit_products. We will create a view named v_consolidated_credit_products. A Hive
view works similar to a normal database view where it limits the data movement and is dependent on its
parent tables.
hive> create view v_consolidated_credit_products as select
hcd.customer_id,hcd.credit_card_limits,hcd.credit_balance,hmda.n_mortgages,hmd
a.mortgage_amount from mortgages_department_agg hmda join credit_department
hcd on hcd.customer_id=hmda.customer_id;
OK
Time taken: 0.316 seconds

6. Same as the Hive table the describe command is available to see the columns in the view. Go the
terminal and type:
hive> desc v_consolidated_credit_products;
OK
customer_id
string
credit_card_limits
int
credit_balance
int
n_mortgages
bigint
mortgage_amount
double
Time taken: 0.227 seconds, Fetched: 5 row(s)

Why did describing the view not kick off a MapReduce job?
____________________________________________________________________________________
____________________________________________________________________________________
7. Execute the query :
hive> select * from v_consolidated_credit_products;
Total jobs = 1
15/07/15 03:53:13 WARN conf.Configuration: file:/tmp/oracle/hive_2015-0715_03-53-09_215_459889610370842942-1/-local-10006/jobconf.xml:an attempt to
override final parameter: mapreduce.job.end-notification.max.retry.interval;
Ignoring.
34

15/07/15 03:53:13 WARN conf.Configuration: file:/tmp/oracle/hive_2015-0715_03-53-09_215_459889610370842942-1/-local-10006/jobconf.xml:an attempt to


override final parameter: mapreduce.job.end-notification.max.attempts;
Ignoring.
15/07/15 03:53:14 WARN conf.HiveConf: DEPRECATED: Configuration property
hive.metastore.local no longer has any effect. Make sure to provide a valid
value for hive.metastore.uris if you are connecting to a remote metastore.
Execution log at: /tmp/oracle/oracle_20150715035353_1a30b505-79e7-49b4-98358c07b5b90fba.log
2015-07-15 03:53:15
Starting to launch local task to process map join;
maximum memory = 257949696
2015-07-15 03:53:16
Dump the side-table into file:
file:/tmp/oracle/hive_2015-07-15_03-53-09_215_459889610370842942-1/-local10003/HashTable-Stage-3/MapJoin-mapfile21--.hashtable
2015-07-15 03:53:16
Uploaded 1 File to: file:/tmp/oracle/hive_2015-0715_03-53-09_215_459889610370842942-1/-local-10003/HashTable-Stage-3/MapJoinmapfile21--.hashtable (38073 bytes)
2015-07-15 03:53:16
End of local task; Time Taken: 1.651 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1436834454401_0008, Tracking URL =
http://bigdatalite.localdomain:8
CU9766
800
0
1
4850.0
CU9789
800
52123 1
6000.0
CU9790
800
0
1
3000.0
CU9791
800
0
1
900.0
CU9798
800
0
1
0.0
CU9820
800
0
1
1200.0
CU9823
1500 0
1
0.0
CU9839
800
0
1
700.0
CU985 500
0
1
4000.0
CU9863
800
0
1
2500.0
CU9868
1500 0
1
700.0
CU9897
2500 0
1
1000.0
CU9913
1000 0
1
1000.0
CU9932
1000 0
1
0.0
CU9964
700
0
1
1100.0
CU998 1500 0
1
2500.0
CU9988
1000 0
1
600.0
Time taken: 31.751 seconds, Fetched: 1023 row(s)
8. This exercise is now finished, so we will first exit Hive.
Go to the terminal and type:
hive> exit;
[oracle@bigdatalite hive]$
Then press Enter
9. We can now close the terminal. Go to the terminal and type:
[oracle@bigdatalite hive]$ exit
Then press Enter
35

Lab Summary

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It is very useful for non-Java
programmers. Querying and analyzing large data sets is much easier for those with a database or SQL
background.
In this exercise you were able to do many of the same steps done in Lab B but instead of compiling java code,
you were able to execute similar MapReduce tasks using the Hive Query Language.

Do not proceed to the next section until you have clearance from your
instructor.

36

Lab D: Apache Pig Overview


Objective: Use Pig to aggregate data for sample customers

! 30 minutes

In this exercise we will be processing monthly cash accounts data for the customers in our system. We have a
comma delimited file with five columns: customer id, checking amount, bank funds, number of monthly checks
written and number of automatic payments performed. This data is retrieved from one of our internal systems on a
monthly basis and contains the monthly cash situation for each customer. The file that will be processed contains
data for three months, therefore we will see three rows for each customer, each row corresponding to one month.
Once we will have loaded the data in Pig, we will proceed to aggregate it so that we obtain monthly average
values per customer, for each measured dimension. These values will then be imported in a later exercise into the
Oracle Database along with data from other systems, in order to create a 360 degree view of our customers.
In this exercise we will:
1) Load our monthly cash account data into our HDFS
2) Run a PIG script which aggregate the data to determine the average values of the accounts for each
customer
3) View the results and save them into an HDFS file that will be imported into the Oracle Database in a later
exercise.

Task 1: Load Cash Account data into HDFS


1. All of the setup and execution for this exercise can be done from the terminal, hence open the terminal by
clicking on the Terminal icon on the desktop.

2. To get into the folder where the scripts for this exercise are, type in the terminal:
[oracle@bigdatalite hive]$ cd /home/oracle/exercises/pig
[oracle@bigdatalite pig]$ ls
cheat.sh cleanup.sh export_monthly_cash_accounts.csv
Then press Enter
3. To get an idea of what our monthly cash accounts file looks like lets look at the first couple of rows. In the
terminal type:
[oracle@bigdatalite pig]$ head export_monthly_cash_accounts.csv
CU8983,27.5,0,0,0
CU8983,25,0,0,0
CU8983,22.5,0,0,0
CU1655,176,501,0,668
CU1655,160,450.9,1,568
CU1655,144,551.1,2,768
CU13483,27.5,750,13,161
CU13483,25,675,14,61
CU13483,22.5,825,15,261
CU9863,27.5,0,0,0
Then press Enter
37

The first 10 rows of the data file will be displayed on the screen. The first column represents the customer
id, followed by the monthly checking amount, bank funds, number of checks written in a month and
number of automatic payments performed.
Customer ID 9863 has 27.50 in her checking account, 0 bank funds, has written 0 checks
and has no automatic payments coming out of her account.

4. Now that we have an idea of what our data file looks like, lets load it into the HDFS for processing. To
load the data we use the copyfromLocal function of Hadoop, go to the terminal and type:
[oracle@bigdatalite pig]$ hadoop fs -copyFromLocal
export_monthly_cash_accounts.csv
Then press Enter

What does the copyFromLocal command do?


____________________________________________________________________________________
____________________________________________________________________________________
Task 2: Run Pig Script to aggregate data
1. We will be running our PIG script in interactive mode so that we can see each step of the process. For
this we will need to open the PIG interpreter called grunt. To do this, go to the terminal and type:
[oracle@bigdatalite pig]$ pig
Then press Enter. Ignore any deprecation warnings.

To reduce the errors with typing, feel free to open up cheat.sh using gedit. Dont forget to
go through each command to understand what is being done.
2. Once we are in the grunt shell we can start typing Pig script. The first thing we need to do is load the
datafile from HDFS into Pig for processing. The data is not actually copied but a handler is created for
the file so Pig knows how to interpret the data. Go to the grunt shell and type:
grunt> monthly_cash_accounts = load 'export_monthly_cash_accounts.csv' using
PigStorage(',') as
(customer_id,checking_amount,bank_funds,monthly_checks_written,t_amount_autom_
payments);
Then Press Enter

38

3. Now the data is loaded as a five column table lets see what the data looks like in PIG. Go to the grunt
shell and type:
grunt> dump monthly_cash_accounts;
2015-07-15 04:04:01,915 [main] INFO org.apache.pig.tools.pigstats.ScriptState
- Pig features used in the script: UNKNOWN
2015-07-15 04:04:01,998 [main] INFO
org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite,
GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer,
LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer,
PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter,
StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2015-07-15 04:04:02,170 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File
concatenation threshold: 100 optimistic? false
2015-07-15 04:04:02,277 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimiz
er - MR plan size before optimization: 1
2015-07-15 04:04:02,277 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimiz
er - MR plan size after optimization: 1
2015-07-15 04:04:02,404 [main] INFO org.apache.hadoop.yarn.client.RMProxy Connecting to ResourceManager at localhost/127.0.0.1:8032
2015-07-15 04:04:02,808 [main] INFO org.apache.pig.tools.pigstats.ScriptState
- Pig script settings are added to the job
2015-07-15 04:04:02,916 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use
mapreduce.reduce.markreset.buffer.percent
2015-07-15 04:04:02,916 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompile
r - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-07-15 04:04:02,916 [main] INFO
. . .
2015-07-15 04:04:09,430 [JobControl] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to
process : 1
2015-07-15 04:04:09,430 [JobControl] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : 1
2015-07-15 04:04:09,462 [JobControl] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths (combined) to process : 1
(CU6234,695.2,9850,3,3005)
(CU6234,632,8865,4,2905)
(CU6234,568.8,10835,5,3105)
(CU12312,27.5,500,6,125)
(CU12312,25,450,7,25)
(CU12312,22.5,550,8,225)
(CU975,809.6,6600,1,2321)
(CU975,736,5940,2,2221)
(CU975,662.4,7260,3,2421)
(CU7799,27.5,500,14,196)
(CU7799,25,450,15,96)
(CU2753,27.5,1800,1,2086)
(CU2753,25,1620,2,1986)
(CU2753,22.5,1980,3,2186)
(CU11667,27.5,1050,2,211)
39

You will see output similar to the Word-Count exercise on the screen. This is normal as Pig is merely a
high level language. This will apply to all of the commands you will run in Pig. The output on the screen
will show you all of the rows of the file in tuple form.
All commands which process data simply run Map Reduce tasks in the background, so the dump
command simply becomes a map reduce job that is run.
Customer 11667 has 27.50 in his checking account, 1050 in total bank funds. He wrote 2
checks and 211.00 in automatic payments were drafted from his account.
4. The first step in analyzing the data will be grouping the data by customer id so that we have all of the
account details of one customer grouped together. Go to the grunt shell and type:
grunt> grouped= group monthly_cash_accounts by customer_id;
2015-07-15 04:05:46,124 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS
Then press Enter
5. Lets go ahead and dump this grouped variable to the screen to see what its contents look like. Go to the
grunt shell and type:
grunt> dump grouped;
(CU15809,{(CU15809,25,0,0,0),(CU15809,27.5,0,0,0),(CU15809,22.5,0,0,0)})
(CU15821,{(CU15821,143.1,0,9,272),(CU15821,159,0,8,72),(CU15821,174.9,0,7,172)
})
(CU15828,{(CU15828,27.5,0,0,505),(CU15828,25,0,0,405),(CU15828,22.5,0,0,605)})
(CU15839,{(CU15839,22.5,0,0,0),(CU15839,27.5,0,0,0),(CU15839,25,0,0,0)})
(CU15852,{(CU15852,27.5,0,0,0),(CU15852,25,0,0,0),(CU15852,22.5,0,0,0)})
(CU15853,{(CU15853,795.3,500,5,1114),(CU15853,723,450,6,1014),(CU15853,650.7,5
50,7,1214)})
(CU15854,{(CU15854,25,585,15,88),(CU15854,22.5,715,16,288),(CU15854,27.5,650,1
4,188)})
(CU15866,{(CU15866,27.5,250,1,0),(CU15866,25,225,2,0),(CU15866,22.5,275,3,0)})
(CU15879,{(CU15879,22.5,0,0,0),(CU15879,25,0,0,0),(CU15879,27.5,0,0,0)})
(CU15886,{(CU15886,25,225,4,0),(CU15886,22.5,275,5,0),(CU15886,27.5,250,3,0)})
(CU15889,{(CU15889,22.5,0,0,0),(CU15889,27.5,0,0,0),(CU15889,25,0,0,0)})
(CU15927,{(CU15927,27.5,0,8,0),(CU15927,22.5,0,10,0),(CU15927,25,0,9,0)})
(CU15942,{(CU15942,22.5,275,0,0),(CU15942,25,225,0,0),(CU15942,27.5,250,0,0)})
(CU15957,{(CU15957,22.5,0,13,0),(CU15957,25,0,12,0),(CU15957,27.5,0,11,0)})
(CU15960,{(CU15960,22.5,0,2,0),(CU15960,27.5,0,0,0),(CU15960,25,0,1,0)})
(CU15979,{(CU15979,25,0,4,0),(CU15979,22.5,0,5,0),(CU15979,27.5,0,3,0)})
(CU15988,{(CU15988,27.5,1200,2,135),(CU15988,25,1080,3,35),(CU15988,22.5,1320,
4,235)})
Then press Enter
On the screen you will see all of the groups displayed in tuples of tuples form. As the output might look a
bit confusing only some tuples are highlighted in the screenshot below to help clarity.
(CU15828,{(CU15828,27.5,0,0,505),(CU15828,25,0,0,405),(CU15828,22.5,0,0,605)})
(CU15879,{(CU15879,22.5,0,0,0),(CU15879,25,0,0,0),(CU15879,27.5,0,0,0)})
(CU15979,{(CU15979,25,0,4,0),(CU15979,22.5,0,5,0),(CU15979,27.5,0,3,0)})

40

Customer 15828 has 3 records in the grouping from step 4.

6. In the next step we will go through each group tuple and get the customer id and the average values for
the checking amount, bank funds, number of checks written monthly and number of automatic payments
performed monthly. Go to the grunt shell and type:
grunt> average= foreach grouped generate group,
AVG(monthly_cash_accounts.checking_amount),AVG(monthly_cash_accounts.bank_fund
s),AVG(monthly_cash_accounts.monthly_checks_written),AVG(monthly_cash_accounts
.t_amount_autom_payments);
Then press Enter
7. Lets go ahead and see what this output looks like. Go to the grunt shell and type:
grunt> dump average;
2015-07-15 04:17:40,301 [main] INFO org.apache.pig.tools.pigstats.ScriptState
- Pig features used in the script: GROUP_BY
2015-07-15 04:17:40,375 [main] INFO
org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite,
GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer,
LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer,
PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter,
StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2015-07-15 04:17:40,605 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File
concatenation threshold: 100 optimistic? false
(CU15821,159.0,0.0,8.0,172.0)
(CU15828,25.0,0.0,0.0,505.0)
(CU15839,25.0,0.0,0.0,0.0)
(CU15852,25.0,0.0,0.0,0.0)
(CU15853,723.0,500.0,6.0,1114.0)
(CU15854,25.0,650.0,15.0,188.0)
(CU15866,25.0,250.0,2.0,0.0)
(CU15879,25.0,0.0,0.0,0.0)
(CU15886,25.0,250.0,4.0,0.0)
(CU15889,25.0,0.0,0.0,0.0)
(CU15927,25.0,0.0,9.0,0.0)
(CU15942,25.0,250.0,0.0,0.0)
(CU15957,25.0,0.0,12.0,0.0)
(CU15960,25.0,0.0,1.0,0.0)
(CU15979,25.0,0.0,4.0,0.0)
(CU15988,25.0,1200.0,3.0,135.0)
Then press Enter
Now you can see on the screen a dump of all customer ids with their respective average account values.
A couple of them are hilighted below.

(CU15886,25.0,250.0,4.0,0.0)

(CU15854,25.0,650.0,15.0,188.0)
41

Customer 15886 has an average checking amount of 25.00. She has an average of 250 in
the bank, wrote an average of 4 checks and had no automatic payments.

8. Now that we have calculated the average monthly values for each customer id it would be nice if we had
them in order from highest to lowest checking_amount. Lets get that list, so go to the grunt shell and type:
sorted = order average by $1 DESC;
Then press Enter
grunt> sorted = order average by $1 DESC;
2015-07-15 04:19:56,533 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS
We can now see what the sorted list looks like. Go to the grunt terminal and type:
grunt> dump sorted;
2015-07-15 04:20:32,704 [main] INFO org.apache.pig.tools.pigstats.ScriptState
- Pig features used in the script: GROUP_BY,ORDER_BY
2015-07-15 04:20:32,706 [main] INFO
org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite,
GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer,
LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer,
PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter,
StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
(CU726,25.0,23000.0,1.0,30329.0)
(CU685,25.0,2401.0,4.0,3702.0)
(CU544,25.0,0.0,0.0,503.0)
(CU513,25.0,0.0,0.0,504.0)
(CU491,25.0,3686.0,1.0,958.0)
(CU475,25.0,0.0,2.0,504.0)
(CU396,25.0,9000.0,3.0,2680.0)
(CU339,25.0,17001.0,14.0,21264.0)
(CU293,25.0,3200.0,2.0,2826.0)
(CU225,25.0,3100.0,2.0,1046.0)
(CU216,25.0,2201.0,4.0,55686.0)
(CU197,25.0,0.0,0.0,598.0)
(CU161,25.0,6500.0,4.0,2220.0)
(CU153,25.0,0.0,0.0,504.0)
(CU141,25.0,3500.0,12.0,2975.0)
(CU129,25.0,0.0,1.0,713.0)
(CU27,25.0,0.0,0.0,506.0)
(CU2,25.0,0.0,0.0,542.0)
Then press Enter
On the screen you now see the list sorted in descending order. Displayed on the screen are accounts with
the lowest checking amounts (column highlighted in bold) but you can scroll up the see the rest of the
values.

42

Task 3: Store data into HDFS

1. We now have the final results that we want. We will proceed to write these results out to HDFS, as we will
need them in a file that will be processed by the Oracle SQL Connector for HDFS (OSCH) in a later
exercise. The OSCH will take the data out of the HDFS file and load it into the Oracle Database, where
we want to create a 360 degree view of our customers. Go to the grunt shell and type:
grunt> store sorted into 'customer_averages' using PigStorage(',');
2015-07-15 04:24:37,109 [main] INFO org.apache.pig.tools.pigstats.ScriptState
- Pig features used in the script: GROUP_BY,ORDER_BY
2015-07-15 04:24:37,111 [main] INFO
org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite,
GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer,
LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer,
PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter,
StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2015-07-15 04:24:37,115 [main] INFO
org.apache.hadoop.conf.Configuration.deprecation mapred.textoutputformat.separator is deprecated. Instead, use
mapreduce.output.textoutputformat.separator
. . .
2015-07-15 04:26:29,733 [main] INFO org.apache.hadoop.ipc.Client - Retrying
connect to server: bigdatalite.localdomain/127.0.0.1:12019. Already tried 0
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
sleepTime=1000 MILLISECONDS)
2015-07-15 04:26:30,735 [main] INFO org.apache.hadoop.ipc.Client - Retrying
connect to server: bigdatalite.localdomain/127.0.0.1:12019. Already tried 1
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
sleepTime=1000 MILLISECONDS)
2015-07-15 04:26:31,736 [main] INFO org.apache.hadoop.ipc.Client - Retrying
connect to server: bigdatalite.localdomain/127.0.0.1:12019. Already tried 2
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3,
sleepTime=1000 MILLISECONDS)
2015-07-15 04:26:31,848 [main] INFO
org.apache.hadoop.mapred.ClientServiceDelegate - Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-07-15 04:26:32,133 [main] INFO
org.apache.hadoop.mapred.ClientServiceDelegate - Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-07-15 04:26:32,361 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!
Then press Enter
2. The new calculated data is now permanently stored in HDFS. We can now exit the grunt shell. Go to the
grunt shell and type:
grunt> quit;
Then press Enter

43

3. Now back at the terminal lets view some of the content of the HDFS file. Go to the terminal and type:
[oracle@bigdatalite pig]$ hadoop fs -cat /user/oracle/customer_averages/partr-00000 | head
CU5632,23476.0,5036.0,9.0,24628.0
CU4125,23324.0,13602.0,2.0,24350.0
CU15635,23122.0,2100.0,2.0,23180.0
CU6503,23093.0,27000.0,2.0,29601.0
CU4871,21998.0,4100.0,15.0,22716.0
CU3167,21069.0,10525.0,1.0,98684.0
CU2320,20055.0,5200.0,10.0,20719.0
CU6758,19357.0,4040.0,0.0,20226.0
CU3229,18779.0,22700.0,4.0,23870.0
CU6117,18683.0,2250.0,2.0,20571.0
Then press Enter
This command simply did cat on the results file available in the HDFS.
4. That concludes the Pig exercise. You can now close the terminal window. Go to the terminal and type:
oracle@bigdatalite pig]$ exit
Then press Enter

Lab Summary

Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is
called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation that makes
MapReduce programming high level, similar to that of SQL for RDBMS systems.
You may be asking yourself, when would I use Pig vs. Hive?
Pig Latin is procedural, it fits very naturally in the pipeline paradigm. SQL on the other hand is declarative.
Consider, for example, a simple pipeline, where data from sources users and clicks is to be joined and filtered,
and then joined to data from a third source geoinfo and aggregated and finally stored into a table
ValuableClicksPerDMA.
44

In SQL this could be written as:


insert into ValuableClicksPerDMA
select dma, count(*) from geoinfo join
( select name, ipaddr
from users join
clicks on (users.name = clicks.user)
where value > 0;
) using ipaddr group by dma;
The Pig Latin for this will look like:
Users
= load 'users' as (name, age, ipaddr);
Clicks
= load 'clicks' as (user, url, value);
ValuableClicks
= filter Clicks by value > 0;
UserClicks
= join Users by name, ValuableClicks by user;
Geoinfo
= load 'geoinfo' as (ipaddr, dma);
UserGeo
= join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA
= group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
Use of one versus the other is ultimately going to depend on the data and use case.

Do not proceed to the next section until you have clearance from your
instructor.

45

Lab E: Oracle NoSQL


Objective: Manage and maintain the NoSQL database

! 50 minutes

In this exercise you will be experimenting with the Oracle NoSQL Database. We need to process and retrieve our
banking transactions-related data from our production NoSQL database, in order to prepare it for import into to the
Oracle Database. The import will be performed in a later exercise. Before we proceed to our actual use-case,
however, you will get to experiment with some simple examples, so that you can get an understanding of what
Oracle NoSQL Database is and how it can be used. Most of the exercises will have you look at pre-written Java
code, then compile and run that code. Ensure that you understand the code and all of its nuances as it is what
makes up the NoSQL Database interface. If you would like to understand all of the functions that are available,
see the Javadoc available on the Oracle Website here: http://docs.oracle.com/cd/NOSQL/html/javadoc/
In this exercise you will:
1. Insert and retrieve a simple key value pair from the NoSQL Database
2. Experiment with the storeIterator functionality to retrieve multiple values at the same time
3. Use Oracle Database external tables to read data from a NoSQL Database
4. Use AVRO schemas to store complex data inside the value field
5. Aggregate complex data using Map-Reduce and store the results in HDFS
Task 1: Insert and retrieve a simple key value pair from the NoSQL Database
1. All of the setup and execution for this exercise can be done from the terminal, hence open a terminal by
double clicking on the Terminal icon on the desktop.

2. To get into the folder where the scripts for the NoSQL exercise are, type in the terminal:
oracle@bigdatalite pig]$ cd /home/oracle/exercises/noSQL
Then press Enter
3. Before we do anything with the NoSQL Database we must first start it. So lets go ahead and do that. Go
to the terminal and type:
oracle@bigdatalite pig]$

java -jar $KVHOME/lib/kvstore.jar kvlite &

Then press Enter

46

Please wait until the following message appears in the terminal, before proceeding.

4. Once the message appears, press Enter to regain control of the terminal prompt.
5. To check if the database is up and running we can do a ping on the database. Go to the terminal and
type:
[oracle@bigdatalite noSQL]$ java -jar $KVHOME/lib/kvstore.jar ping -port 5000
-host `hostname`
Then press Enter
You will see Status: RUNNING displayed within the text. This shows the database is up and running.
Pinging components of store kvstore based upon topology sequence #14
Time: 2015-07-15 08:31:06 UTC
kvstore comprises 10 partitions and 1 Storage Nodes
Storage Node [sn1] on bigdatalite.localdomain:5000
Zone: [name=KVLite
id=zn1 type=PRIMARY]
Status: RUNNING
Ver: 12cR1.3.2.5 2014-12-05 01:49:22
UTC Build id: 7ab4544136f5
Rep Node [rg1-rn1]
Status: RUNNING,MASTER at sequence number: 37
haPort: 5006

6. Before we start processing our transactional data that is needed to create the 360 degree view of the
customers, lets take some time to discover how the Oracle NoSQL Database can be used through some
simple Hello World-type exercises.
Oracle NoSQL Database uses a Java interface to interact with the data. This is a dedicated Java API
which will let you insert, update, delete and query data in the Key/Value store that is the NoSQL
Database. Lets look at a very simple example of Java code where we insert a Key/Value pair into the
database and then retrieve it.
Go to the terminal and type:
[oracle@bigdatalite noSQL]$ gedit Hello.java
Then press Enter

47

A new window will pop up with the code. In this code there are a couple of things to be noted. We see the
config variable that holds our connection string and the store variable that is our connection factory to the
database. They are the initialization variables for the Key/Value Store and are highlighted in yellow. Next
we see the definition of 2 variables, of type Key and Value, they will serve as our payload to be inserted.
These are highlighted in green. Next we have highlighted in purple the actual insert command.
Highlighted in blue is the retrieve command for getting data out of the database.

48

6. When you are done evaluating the code click on the X in the right upper corner of the window to close it.
7. Lets go ahead and compile that code. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ javac Hello.java
Then press Enter
8. Now that the code is complied lets run it. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ java Hello
Key: /Hello
Value: Big Data World!
Then press Enter
You will see the Key we inserted in the Database (Hello), together with the Value (Big Data World!)
printed on the screen.
Task 2: Retrieve multiple values at the same time
1. Oracle NoSQL Database has the possibility of having a major and a minor component to the Key. This
feature can be very useful when trying to group and retrieve multiple items at the same time from the
database. Additionally, the major component of the Key may be comprised of more than one item. In the
next code we have 2 major components to the Key, each major component being represented by a pair of
items ((Participant,Mike) and (Participant,Dave)); each major component has a minor component
(Question and Answer). We will insert a Value for each Key and we will use the storeIterator function to
retrieve all of the Values regardless of the minor component of the Key for Mike and completely ignore
Dave. Lets see what that code looks like.
Go to the terminal and type:
[oracle@bigdatalite noSQL]$ gedit Keys.java
import java.io.File;
import java.io.IOException;
import java.util.*;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import oracle.kv.*;

public class Keys


{
public static void main(String[] args)
{
KVStoreConfig config = new KVStoreConfig("kvstore", "localhost:5000");
KVStore store = KVStoreFactory.getStore(config);
49

String questionMike = "This is a question from Mike";


. . .
store.put(MikeKey1,MikeQuestion);
store.put(MikeKey2,MikeAnswer);
store.put(DaveKey1,DaveQuestion);
store.put(DaveKey2,DaveAnswer);
Iterator<KeyValueVersion> MyStoreIterator =
store.storeIterator(Direction.UNORDERED, 0, Key.createKey("Participant"), new
KeyRange("Mike"), null);
while (MyStoreIterator.hasNext())
{
KeyValueVersion entry = MyStoreIterator.next();
Key k = entry.getKey();
Value v = entry.getValue();
System.out.println(k.toString() + " - " + new
String(v.getValue()));
}
Then press Enter
A new window will pop up with the code. If you scroll to the bottom you will remark the code in the
screenshot below. Highlighted in purple are the insertion calls that will add the data to the database. The
retrieval of multiple records is highlighted in blue, and the green shows the display of the retrieved data.
Do note that there were 4 Key-Value pairs inserted into the database.

When you are done evaluating the code click on the X in the right upper corner of the window to close it.
2. Lets go ahead and compile that code. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ javac Keys.java
50

Now that the code is complied lets run it. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ java Keys
/Participant/Mike/-/Answer - This is an answer from Mike
/Participant/Mike/-/Question - This is a question from Mike
Then press Enter
You will see the 2 values that are stored under the (Participant, Mike) major key displayed on the screen,
and no values for the (Participant, Dave) major key. Major and minor parts of the key can be composed of
multiple strings and further filtering can be done. This is left up to the participants to experiment with.
Task 3: Use Oracle External Tables to read data from NoSQL
1. We have now added some keys and data to our noSQL DB the Oracle NoSQL DB comes with a
command line utility to view and edit data in the DB.
Run the command :
java -jar $KVHOME/lib/kvcli.jar -host localhost -port 5000 -store kvstore
[oracle@bigdatalite noSQL]$ java -jar $KVHOME/lib/kvcli.jar -host localhost
-port 5000 -store kvstore

2. Enter help at the kv-> prompt to see the available commands

3. Enter get kv key /Hello at the kv-> prompt to display the first key we entered in this exercise.
kv-> get kv -key /Hello
Big Data World!
We entered this key when we compiled Hello.java in Task 1

4. Enter put kv key /Hello value goodbye yall at the kv-> prompt, then enter get kv
key /Hello at the kv-> prompt. You have just successfully updated a value in the noSQL DB.
51

kv-> put kv -key /Hello -value "goodbye class"


Operation successful, record updated.
kv-> get kv -key /Hello
goodbye class
5. Enter exit at the kv-> prompt.
kv-> exit
Task 4: Create external tables in Oracle DB
1. We will now proceed to show how to create external tables in the Oracle Database, pointing to data
stored in a NoSQL Database. We will create the Oracle external table that will allow Oracle users to query
the NoSQL data. For this exercise we will use the NoSQL data that we just inserted and create a table
with three columns: the name (Mike or Dave), the type (Question or Answer) and the text (This is
a(n) question/answer from Mike/Dave). This table will contain all the NoSQL records for which the first
item in the Keys major component is Participant; basically, it will contain all four records that were
inserted.
First, make sure the database is running, second, lets review the command used to create the external
table. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ gedit createTable.sh
Then press Enter
First we create an operating system directory (highlighted in yellow). Then, we create the necessary
Oracle Database directory objects and grant appropriate privileges to the BDA user, which will hold our
external table (highlighted in purple). Finally, we issue the command that creates the external table called
EXTERNAL_NOSQL (highlighted in green). Please notice the type (oracle_loader) and preprocessor
used in the CREATE TABLE command; a specific NoSQL preprocessor is used.

52

We just created an external table that reads data from the file: nosql.dat

2. When you are done evaluating the script, click on the X in the right upper corner of the window to close it.
3. We are now ready to run the script that creates the external table. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ ./createTable.sh
ORACLE_SID = [orcl] ? The Oracle base remains unchanged with value
/u01/app/oracle
SQL*Plus: Release 12.1.0.2.0 Production on Fri Jul 31 17:56:10 2015
Copyright (c) 1982, 2014, Oracle.

All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing
options
SQL>
53

Directory created.
SQL>
Directory created.
SQL>
Grant succeeded.
SQL>
Grant succeeded.
SQL>
Grant succeeded.
SQL> Disconnected from Oracle Database 12c Enterprise Edition Release
12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing
options
SQL*Plus: Release 12.1.0.2.0 Production on Fri Jul 31 17:56:10 2015
Copyright (c) 1982, 2014, Oracle.

All rights reserved.

Last Successful login time: Fri Jul 31 2015 17:50:27 -04:00


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing
options
SQL> SQL>
2
Table created.

SQL> Disconnected from Oracle Database 12c Enterprise Edition Release


12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing
options
[oracle@bigdatalite noSQL]$
Then press Enter
4. For this particular exercise we also need a Java Formatter class to be used for converting the K/V pairs
from the NoSQL record format to the External Table format. Lets go ahead and review that Java code.
Go to the terminal and type:
[oracle@bigdatalite noSQL]$ gedit externaltables/MyFormatter.java
Then press Enter
Highlighted in yellow are the commands used to get the second item of the Keys major component
(majorPath.get(1)) in the userName variable, the Keys minor component in the type variable and the
Value in the text variable. The concatenation of these three variables, separated by the pipe sign (|) is
highlighted in purple.

54

5. When you are done evaluating the code, click on the X in the right upper corner of the window to close it.
6. Lets go ahead and compile that code. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ javac externaltables/MyFormatter.java
Then press Enter
7. Now that the code is compiled, lets place it in a JAR file. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ jar -cf externaltables.jar externaltables/*.class
[oracle@bigdatalite noSQL]$ ls
addTransSchema.sh
external_noSQL_config.xml Hello.java
pingNoSQL.sh
cheat.sh
externaltables
Keys.class
publish.sh
cleanup.sh
externaltables.jar
Keys.java
startNoSQL.sh
55

createTable.sh
stopNoSQL.sh
dropTable.sh
trans.avsc
export_transactions.csv
viewData.sh

exttab

kvcli.sh

Hadoop.java

kvroot

Hello.class

NoSQLTransactions.java

Then press Enter


8. To connect the external table to the NoSQL Database, we will have to run a publish command. This
command makes use of an XML configuration file. Lets review that file first. Go to the terminal and type:
[oracle@bigdatalite noSQL]$

gedit external_noSQL_config.xml

Then press Enter


This file contains the name of the Oracle external table (EXTERNAL_NOSQL), the Parent Key of the
NoSQL records (the first item in the Keys major component) and the name of the formatter class used to
format the NoSQL records.

9. When youre done reviewing the file click the X in the top right corner of the window to close it.
10. Lets now run the publish command that connects the Oracle external table to the NoSQL data. Go to the
terminal and type:
[oracle@bigdatalite noSQL]$ java -classpath
$KVHOME/lib/*:/home/oracle/exercises/noSQL/externaltables:$ORACLE_HOME/jdbc/li
b/ojdbc6.jar oracle.kv.exttab.Publish -config external_noSQL_config.xml publish
56

[Enter Database Password:]


Publish was successful.
11. You will be asked to enter a password for the code to be able to login to the database user. Enter the
following information
[Enter Database Password:]: welcome1
Then Press Enter
NOTE: No text will appear while you type
We will now be able to query the external table. We have a script that does just that. Lets review it first,
so go to the terminal and type:
[oracle@bigdatalite noSQL]$ gedit viewData.sh
Then press Enter
The script performs a simple select on the BDA.EXTERNAL_NOSQL table that weve just created.

12. When youre done reviewing the script click the X in the top right corner of the window to close it.
13. Lets run the script that queries the table. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ ./viewData.sh
ORACLE_SID = [orcl] ? The Oracle base remains unchanged with value
/u01/app/oracle
SQL*Plus: Release 12.1.0.2.0 Production on Fri Jul 31 18:07:15 2015
Copyright (c) 1982, 2014, Oracle.

All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing
options
SQL> SQL>
NAME
TYPE
TEXT
------------------------------ -------------------- --------------------------------------57

Mike
Mike
Dave
Dave

Answer
Question
Answer
Question

This is an answer
This is a question from
This is an answer
This is a question from

from Mike
Mike
from Dave
Dave

SQL> Disconnected from Oracle Database 12c Enterprise Edition Release


12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing
options
Then press Enter
The records generated by the Java program that was run a few steps ago to insert some data into the
NoSQL Database are visible from the Oracle database.
Task 5: Load Data into NoSQL
1. Now that we have an idea about how we can interface with the NoSQL database, lets proceed to process
the data that we need for our analysis. First, we will load some data into the NoSQL Database. Normally,
the transaction-related data that we are about to process should be loaded in a real-time manner from the
production environment. In our case, we will perform the load ourselves, as our data is stored in a CSV
file.
Before loading the data, however, lets have a look at its structure. With version 2.0 of the Oracle NoSQL
Database we have integrated AVRO serialization into our database. This has greatly increased the
flexibility of what you can store as the Value of a Key/Value pair. With AVRO you can now define an entire
schema that will be stored in the value field. To view the AVRO schema that will hold our transactional
data, go to the terminal and type:
[oracle@bigdatalite noSQL]$ gedit trans.avsc
Then press Enter
Our schemas name is trans, highlighted in yellow in the screenshot below. The entire declaration of a
field is highlighted in green. We have an integer-type field called teller, which holds the value 1 if the
transaction is of type teller or 0 if not. This field is followed by three other integer-type fields called kiosk,
atm and webbank. The same rule as the above also applies to these three fields.
Finally, we have the sum field, of type float, that holds the sum of the transaction.

2. When you are done evaluating the file click on the X in the right upper corner of the window to close it.

58

3. Now that weve reviewed the schema, lets attach it to the NoSQL Database. First we have to open the
NoSQL Database administration console. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ java -jar $KVHOME/lib/kvstore.jar runadmin -port
5000 -host `hostname`
Then press Enter
runadmin opens the NoSQL database admin console
4. Now lets run the command that adds the schema to the NoSQL Database. Go to the terminal and type:
kv-> ddl add-schema -file trans.avsc
Added schema: trans.1
Then press Enter
The add-schema command adds the schema defined in the trans.avsc file
5. Now that weve added the schema, we can exit the NoSQL administration console. Go to the terminal and
type:
kv-> exit
Then press Enter
6. The schema is now added, we can proceed to load the data into the NoSQL Database. Before actually
doing that, lets review the code we will use to perform this action. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ gedit NoSQLTransactions.java
Then press Enter
In the code you will see how to interact with this AVRO schema. In yellow we highlighted how to configure
the interface to the schema from the code. In green is the code on how to fill data into the data structure.
In purple there is the typical code for inserting the data into the database and finally in blue is the code for
retrieving and displaying the data inside the database. Notice that the key has a major and a minor
component. The major component is comprised of two items: the string Load1 and the first column in the
csv file (the customer id). The minor component is an automatically generated string like trans1, trans2,
etc.

59

Only the transactions corresponding to customer id CU9988 are retrieved from the database and will be
displayed in the terminal when running this Java program.

7. When you are done evaluating the code click on the X in the right upper corner of the window to close it.

8. First, we have to compile the code. Go to the terminal and type:


[oracle@bigdatalite noSQL]$ javac NoSQLTransactions.java
Then press Enter
We are now ready to run the Java program. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ java NoSQLTransactions
/Load1/CU9988/-/trans449 - {"teller": 0, "kiosk": 1, "atm": 0, "webbank": 0,
"sum": 1.36}
/Load1/CU9988/-/trans450 - {"teller": 0, "kiosk": 0, "atm": 1, "webbank": 0,
"sum": 1.03}
/Load1/CU9988/-/trans451 - {"teller": 0, "kiosk": 0, "atm": 0, "webbank": 600,
"sum": 50.68}
60

Then press Enter


The data is inserted into the NoSQL Database, and then the transactions corresponding to customer id
CU9988 are retrieved from the database and displayed in the terminal, as highlighted in the bold.
Task 6: Aggregate using Hadoop MapReduce and store in HDFS
1. For our analysis we need an aggregated view of the transactions for each customer. Our final goal is to
load the aggregated data into the Oracle Database using the Oracle Loader for Hadoop (this will be done
a later exercise). To aggregate the transactions, we will use a Hadoop Map-Reduce job that interacts with
the NoSQL data, performs the processing and stores the results into an HDFS file. Lets review the Java
code that does this. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ gedit Hadoop.java
Then press Enter
The Mapper class includes a filter in order to select only the Keys in the NoSQL Database corresponding
1
to our customer IDs (CU*).
2

It then stores the 5 fields from the Value part of the NoSQL records into an array of floats.
Finally, it sends this information grouped by customer id to the Reducer through the context.write
command.

The reducer performs a sum of every component field of the Value (previously stored in the array of
floats), for each customer id.
Then, it creates the output records to be stored in the HDFS file. The teller, kiosk, atm and webbank
values are actually integers, not float values, so a conversion is performed for these four values before
writing them to the output. Only the sum (the last field) is a float value.
Finally, the records are passed forward to be written into the HDFS output file.
The resulting records will contain an aggregated view of the customers transactions. For each customer,
we will see the number of teller, kiosk, atm and webbank transactions, along with the total sum of all
61

transactions. Weve selected a sample of 1015 customers, so our resulting HDFS file will contain 1015
lines, each line corresponding to a customer.

The main function sets the parameters for the Map/Reduce job. Weve highlighted a few in the screenshot
from below. The line highlighted in yellow shows the separator of the fields in the output HDFS file (,).
The lines highlighted in blue show the connection to the NoSQL Database and the line highlighted in
green shows the Parent Key of the NoSQL records to be processed: Load1. We chose this string as the
first item of the Keys major component for our transactional data, when we loaded the records into the
NoSQL Database.

2. When you are done evaluating the code click on the X in the right upper corner of the window to close it.
62

3. Lets create a new directory where we will store the class file resulting from the compilation of the Java
code. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ mkdir kv_classes
Then press Enter
The following statement compiles the Java code and stores the resulting .class file in the previously
created directory. Go to the terminal and type:
oracle@bigdatalite noSQL]$ javac -d kv_classes Hadoop.java
Note: Hadoop.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Then press Enter
[oracle@bigdatalite noSQL]$ ls kv_classes
Hadoop$MyMapper.class
Hadoop$FloatArrayWritable.class
Hadoop$MyReducer.class Hadoop.class
4. The following statement creates a JAR file containing the compiled Java code. Go to the terminal and
type:
[oracle@bigdatalite noSQL]$ jar -cvf CustomerCount.jar -C kv_classes/ .
added manifest
adding: Hadoop$MyMapper.class(in = 2135) (out= 980)(deflated 54%)
adding: Hadoop$FloatArrayWritable.class(in = 353) (out= 244)(deflated 30%)
adding: Hadoop$MyReducer.class(in = 2157) (out= 981)(deflated 54%)
adding: Hadoop.class(in = 2141) (out= 1157)(deflated 45%)
Then press Enter
5. We are now ready to execute the JAR. The output of the Map-Reduce job will be a new directory in HDFS
called loader, as evidenced by the last parameter of the following command. Go to the terminal and type:
[oracle@bigdatalite noSQL]$ hadoop jar CustomerCount.jar Hadoop -libjars
$KVHOME/lib/kvclient.jar loader
15/07/31 18:21:19 INFO client.RMProxy: Connecting to ResourceManager at
localhost/127.0.0.1:8032
15/07/31 18:21:21 INFO mapreduce.JobSubmitter: number of splits:1
15/07/31 18:21:21 INFO Configuration.deprecation:
mapred.textoutputformat.separator is deprecated. Instead, use
mapreduce.output.textoutputformat.separator
15/07/31 18:21:22 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1436834454401_0019
15/07/31 18:21:22 INFO impl.YarnClientImpl: Submitted application
application_1436834454401_0019
15/07/31 18:21:23 INFO mapreduce.Job: The url to track the job:
http://bigdatalite.localdomain:8088/proxy/application_1436834454401_0019/
15/07/31 18:21:23 INFO mapreduce.Job: Running job: job_1436834454401_0019
15/07/31 18:21:23 INFO mapreduce.Job: The url to track the job:
http://bigdatalite.localdomain:8088/proxy/application_1436834454401_0019/
15/07/31 18:21:23 INFO mapreduce.Job: Running job: job_1436834454401_0019
15/07/31 18:21:37 INFO mapreduce.Job: Job job_1436834454401_0019 running in
uber mode : false
15/07/31 18:21:37 INFO mapreduce.Job: map 0% reduce 0%
15/07/31 18:21:49 INFO mapreduce.Job: map 100% reduce 0%
63

15/07/31 18:21:58 INFO mapreduce.Job: map 100% reduce 100%


15/07/31 18:21:58 INFO mapreduce.Job: Job job_1436834454401_0019 completed
successfully
15/07/31 18:21:59 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=250721
FILE: Number of bytes written=713893
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=230
HDFS: Number of bytes written=25357
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=9350
Total time spent by all reduces in occupied slots (ms)=6679
Total time spent by all map tasks (ms)=9350
Total time spent by all reduce tasks (ms)=6679
Total vcore-seconds taken by all map tasks=9350
Total vcore-seconds taken by all reduce tasks=6679
Total megabyte-seconds taken by all map tasks=2393600
Total megabyte-seconds taken by all reduce tasks=1709824
Map-Reduce Framework
Map input records=7533
Map output records=7533
Map output bytes=235649
Map output materialized bytes=250721
Input split bytes=230
Combine input records=0
Combine output records=0
Reduce input groups=1015
Reduce shuffle bytes=250721
Reduce input records=7533
Reduce output records=1015
Spilled Records=15066
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=316
CPU time spent (ms)=6800
Physical memory (bytes) snapshot=445726720
Virtual memory (bytes) snapshot=1765851136
Total committed heap usage (bytes)=318767104
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=25357
64

Then press Enter


6. Once the job is finished, we can view the results. Lets see the content of the part-r-00000 from the loader
HDFS directory. Go to the terminal and type:

[oracle@bigdatalite noSQL]$ hadoop fs -cat loader/part-r-00000


CU100,0,0,3,2000,53.36
CU10006,1,1,0,2000,53.06
CU10011,6,1,4,7250,53.8
CU10012,0,1,0,1001,53.059998
CU10020,1,1,2,2500,53.11
CU10025,1,1,0,1840,53.059998
CU10041,2,8,1,700,53.069996
CU10044,1,1,5,5000,54.33
CU1005,1,0,5,1200,54.21
CU10110,3,1,5,1450,54.49
We now have an aggregated view of the customers transactions stored in an HDFS file, which we will
later use to load the results into the Oracle Database.
7. This exercise is now finished. Go to the terminal and type:
exit
Then press Enter

65

Lab Summary
Oracle NoSQL Database provides a powerful and flexible transaction model that greatly simplifies the process of
developing a NoSQL-based application. It scales horizontally with high availability and transparent load balancing
even when dynamically adding new capacity. By choosing Oracle for both their NoSQL and RDBMS databases,
customers benefit from the integration between these databases, the ability to run Oracle Enterprise Manager as
a single console to manage across the databases, and a single point of customer support
In this exercise, you were introduced to Oracle NoSQLs database. You saw how to insert and retrieve key/value
pairs as well as the storeIterator function where multiple values could be retrieved under the same major
component of a key. You also saw how NoSQL data can be accessed through external tables from the Oracle
Database, how to interact with data stored in AVRO schemas and how NoSQL data can be processed by using
Hadoop MapReduce jobs.

Do not proceed to the next section until you have clearance from your instructor.

66

Lab F: Oracle Loader for Hadoop


Objective: Load data from HDFS into the Oracle Database

! 30 minutes

This Exercise will involve loading our transactional data, initially stored in NoSQL and now residing in an
aggregate format within an HDFS file into the Oracle Database, working with the Oracle Loader for Hadoop.
In this exercise you will:
1. Create an Oracle Database table
2. Load the data from an HDFS file into an Oracle Database table
3. View the results
Task 1: Load HDFS data into the Oracle Database
1. All of the setup and execution for this exercise can be done from the terminal, hence open a terminal by
double clicking on the Terminal icon on the desktop.

2. To get into the folder where the scripts for this exercise are, type in the terminal:
oracle@bigdatalite data]$ cd /home/oracle/exercises/OLH
[oracle@bigdatalite OLH]$ ls
cheat.sh
createTable.sh hdfsMap.xml viewData.sh
cleanup.sh hdfsConfig.xml loadHDFS.sh
Then press Enter
3. First we will create the table that will hold our transactional data. Lets review the script that creates the
table. Go to the terminal and type:
[oracle@bigdatalite OLH]$ gedit createTable.sh
Then press Enter
A new table called LOADER_NOSQL is created in the BDA schema. It will hold the customer_id, the
number of transactions by type (teller, kiosk, atm and web bank) and the sums of the transactions. Please
notice how the structure of the table closely resembles the structure of the HDFS file that we created
when running the Map-Reduce job on NoSQL data.

67

4. When you are done evaluating the table creation script, click on the X in the right upper corner of the
window to close it.
5. Lets proceed to run the script. Go to the terminal and type:
[oracle@bigdatalite OLH]$ ./createTable.sh
ORACLE_SID = [orcl] ? The Oracle base remains unchanged with value
/u01/app/oracle
SQL*Plus: Release 12.1.0.2.0 Production on Fri Jul 31 17:50:27 2015
Copyright (c) 1982, 2014, Oracle.

All rights reserved.

Last Successful login time: Fri Jul 31 2015 17:50:08 -04:00


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing
options
SQL> SQL>
2
Table created.

SQL> Disconnected from Oracle Database 12c Enterprise Edition Release


12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing
options
Then press Enter
The Oracle Loader for Hadoop makes use of two configuration files that have been pre-built for this
68

exercise, hdfsConfig.xml and hdfsMap.xml, that contain useful information necessary to load the data
from the HDFS file into the Oracle Database. Lets review hdfsConfig.xml. Go to the terminal and type:
[oracle@bigdatalite OLH]$ gedit hdfsConfig.xml
Then press Enter
Notice the InputFormat class used by the Oracle Loader for Hadoop: DelimitedTextInputFormat, as the
input for the Oracle Loader for Hadoop is a comma delimited text file. The input.dir property specifies the
directory where the input file resides in HDFS, while the output.dir directory specifies the path in HDFS
where the Oracle Loader for Hadoop will store its output. Please notice that the input directory is
/user/oracle/loader/, which is the output directory of the Hadoop Map/Reduce job that aggregated the
NoSQL records in a previous exercise. The path to the previously mentioned hdfsMap.xml file is also
specified. We will review this file in a later step.

69

If you scroll down you can find other useful information, like the character used to delimit the fields in the
text file (comma), the Oracle Database connection URL, the user schema name and password. Again, the
password could be stored within Oracle Wallet to provide secure authentication which is more likely in
production environments! The last property shown in the screenshot from below specifies the name of the
fields in the input text file. We need these names to perform a mapping between them and the Oracle
Database columns, through the hdfsMap.xml file.

6. When you are done evaluating the configuration file, click on the X in the right upper corner of the window
to close it.
7. Lets review hdfsMap.xml file. Go to the terminal and type:
[oracle@bigdatalite OLH]$ gedit hdfsMap.xml
Then press Enter
The file specifies the mapping between the fields in the text file and the Oracle Database table columns.

70

8. Lets review hdfsMap.xml file. Go to the terminal and type:


[oracle@bigdatalite OLH]$ gedit hdfsMap.xml
9. When you are done evaluating the file, click on the X in the right upper corner of the window to close it.
10. Before running OLH, lets make sure the output HDFS directory doesnt exist from a previous run of the
exercise. To do that, we will issue a command that deletes it. Go to the terminal and type:
[oracle@bigdatalite OLH]$ hadoop fs -rm -r /user/oracle/output3
rm: `/user/oracle/output3': No such file or directory
Then press Enter
Note: You may receive an error stating that the directory doesnt exit. You can ignore it.
The rm r in hadoop is similar to the unix command for removing a directory

Task 2: Load the data from HDFS -> Oracle DB


1. We are now ready to run the Oracle Loader for Hadoop. Go to the terminal and type:
[oracle@bigdatalite OLH]$ hadoop jar $OLH_HOME/jlib/oraloader.jar
oracle.hadoop.loader.OraLoader -conf hdfsConfig.xml
Oracle Loader for Hadoop Release 3.3.0 - Production
Copyright (c) 2011, 2014, Oracle and/or its affiliates. All rights reserved.
15/07/31 18:28:40 INFO loader.OraLoader: Oracle Loader for Hadoop Release
3.3.0 - Production
Copyright (c) 2011, 2014, Oracle and/or its affiliates. All rights reserved.
15/07/31 18:28:40 INFO loader.OraLoader: Built-Against: hadoop-2.2.0 hive0.13.0 avro-1.7.3 jackson-1.8.8
15/07/31 18:28:40 INFO Configuration.deprecation: mapreduce.outputformat.class
is deprecated. Instead, use mapreduce.job.outputformat.class
15/07/31 18:28:40 INFO Configuration.deprecation: mapred.output.dir is
deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/07/31 18:28:40 INFO utils.OraLoaderConf: property
"oracle.hadoop.loader.loaderMapFile" is deprecated; use
"oracle.hadoop.loader.loaderMap.*" instead
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=8482
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=8482
Total vcore-seconds taken by all map tasks=8482
Total megabyte-seconds taken by all map tasks=2171392
Map-Reduce Framework
Map input records=1015
Map output records=1015
Input split bytes=132
Spilled Records=0
Failed Shuffles=0
71

Merged Map outputs=0


GC time elapsed (ms)=125
CPU time spent (ms)=4330
Physical memory (bytes) snapshot=189841408
Virtual memory (bytes) snapshot=894828544
Total committed heap usage (bytes)=116916224
File Input Format Counters
Bytes Read=25357
File Output Format Counters
Bytes Written=1632
15/07/31 18:29:23 INFO Configuration.deprecation: mapred.output.dir is
deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir

Then press Enter


In the output of the Map-Reduce job launched by the Oracle Loader for Hadoop we can see that 1015
rows were processed.

We can see the map job was done then reduction

Task 3: Load the data from HDFS -> Oracle DB


1. Lets view the results. We will query the previously created Oracle Database table to see if the rows
were successfully imported from the HDFS file. We have a script that queries the database table.
Lets review it. Go to the terminal and type:
[oracle@bigdatalite OLH]$ gedit viewData.sh
Then press Enter
72

The script connects to sqlplus with the BDA user and retrieves every column from the LOADER_NOSQL
table of the BDA schema.

2. When you are done evaluating the script, click on the X in the right upper corner of the window to close it.
3. Time to run the script that queries the LOADER_NOSQL table. Go to the terminal and type:
[oracle@bigdatalite OLH]$ ./viewData.sh
CUSTOMER_ID
N_TRANS_TELLER N_TRANS_KIOSK N_TRANS_ATM
N_TRANS_WEB_BANK
------------------------------ -------------- ------------- ----------- --------------MONEY_MONTLY_OVERDRAWN
---------------------CU9932
3
1
3
0
53.38
CU9964

1100

53.83
CU998

2500

53.420002
CUSTOMER_ID
N_TRANS_TELLER N_TRANS_KIOSK N_TRANS_ATM
N_TRANS_WEB_BANK
------------------------------ -------------- ------------- ----------- --------------MONEY_MONTLY_OVERDRAWN
---------------------CU9988
0
1
1
600
53.07
1015 rows selected.
SQL> Disconnected from Oracle Database 12c Enterprise Edition Release
12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advan
CUSTOMER_ID
N_TRANS_TELLER N_TRANS_KIOSK N_TRANS_ATM
N_TRANS_WEB_BANK
73

------------------------------ -------------- ------------- ----------- --------------MONEY_MONTLY_OVERDRAWN


---------------------CU9932
3
1
3
0
53.38
CU9964

1100

53.83
CU998

2500

53.420002
CUSTOMER_ID
N_TRANS_TELLER N_TRANS_KIOSK N_TRANS_ATM
N_TRANS_WEB_BANK
------------------------------ -------------- ------------- ----------- --------------MONEY_MONTLY_OVERDRAWN
---------------------CU9988
0
1
1
600
53.07
1015 rows selected.
SQL> Disconnected from Oracle Database 12c Enterprise Edition Release
12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advan
CUSTOMER_ID
N_TRANS_TELLER N_TRANS_KIOSK N_TRANS_ATM
N_TRANS_WEB_BANK
------------------------------ -------------- ------------- ----------- --------------MONEY_MONTLY_OVERDRAWN
---------------------CU9932
3
1
3
0
53.38
CU9964

1100

53.83
CU998

2500

53.420002
CUSTOMER_ID
N_TRANS_TELLER N_TRANS_KIOSK N_TRANS_ATM
N_TRANS_WEB_BANK
------------------------------ -------------- ------------- ----------- --------------MONEY_MONTLY_OVERDRAWN
---------------------CU9988
0
1
1
600
53.07
1015 rows selected.
SQL> Disconnected from Oracle Database 12c Enterprise Edition Release
12.1.0.2.0 - 64bit Production
74

With the Partitioning, OLAP, Advanced Analytics and Real Application Testing
options
Then press Enter
The 1015 customer-related rows originating in the NoSQL Database are now present in the Oracle
Database.
4. This exercise is now finished, so you can close the terminal window. Go to the terminal and type:
[oracle@bigdatalite OLH] exit
Then press Enter
Lab Summary
In this chapter weve shown how to load data from delimited text files residing in HDFS into Oracle Database
tables, using the Oracle Loader for Hadoop. Please note that the Oracle Loader for Hadoop can also be used to
read data from Oracle NoSQL Databases directly.
For using the loader usually one would go through the following steps:
1) Selecting one of the predefined InputFormats, or build your own
2) Creating the configuration files
3) Invoking the Oracle Loader for Hadoop
This way, Oracle users can have easy access to HDFS/Hive/NoSQL data, using the tools they know and taking
advantage of the powerful capabilities provided by the Oracle Database.

Do not proceed to the next section until you have clearance from your
instructor.

75

Lab G: Oracle SQL Connector for HDFS


Objective: Connect the HDFS file to the Oracle Database using
External Tables

! 30 minutes

This exercise will involve working with Oracle external tables. We need the data related to monthly cash accounts,
previously processed in Pig, to be available inside the Oracle Database. Therefore, the HDFS file that holds this
data will be connected to the Oracle Database using external tables. To do this, we will use the Oracle SQL
Connector for HDFS.
In this exercise you will:
1. Create an external table in the Oracle Database
2. Query the external table which stores data in HDFS
Note: To successfully go through this exercise, you have to go through the Pig Exercise first (see Chapter 5).
Task 1: Configure External Tables stored in HDFS
1. All of the setup and execution for this exercise can be done from the terminal, hence open a terminal by
double clicking on the Terminal icon on the desktop.

2. To get into the folder where the scripts for the external table exercise are, type in the terminal:
[oracle@bigdatalite OLH]$ cd /home/oracle/exercises/OSCH
[oracle@bigdatalite OSCH]$ ls
cheat.sh cleanup.sh connectHDFS.sh etc hdfs_config.xml

viewData.sh

Then press Enter


3. Lets look at the HDFS file, which contains the monthly cash account situations for our sample customers.
Go to the terminal and type:
[oracle@bigdatalite OSCH]$ hadoop fs -cat customer_averages/part-r-00000
CU793,25.0,24600.0,4.0,14975.0
CU790,25.0,16141.0,2.0,560.0
CU726,25.0,23000.0,1.0,30329.0
CU685,25.0,2401.0,4.0,3702.0
CU544,25.0,0.0,0.0,503.0
CU513,25.0,0.0,0.0,504.0
CU491,25.0,3686.0,1.0,958.0
CU475,25.0,0.0,2.0,504.0
CU396,25.0,9000.0,3.0,2680.0
CU339,25.0,17001.0,14.0,21264.0
CU293,25.0,3200.0,2.0,2826.0
CU225,25.0,3100.0,2.0,1046.0
CU216,25.0,2201.0,4.0,55686.0
CU197,25.0,0.0,0.0,598.0
CU161,25.0,6500.0,4.0,2220.0
CU129,25.0,0.0,1.0,713.0
CU27,25.0,0.0,0.0,506.0
CU2,25.0,0.0,0.0,542.0
76

The data in this file has been extracted from our internal systems and processed and aggregated in Pig,
in order to provide a view of the monthly cash accounts for our sample customers. This text file is
comprised of 5 columns, delimited by commas. The first column represents the customer id, followed by
the monthly checking amount, bank funds, number of checks written and the total amount corresponding
to automatic payments.

Task 2: Point external table to data in the HDFS file


1. The Oracle SQL Connector for HDFS will create an external table in the Oracle Database that points to
data in the HDFS file. OSCH requires a XML file for the configuration, a file that has been prebuilt for this
exercise (hdfs_config.xml). Lets review the configuration file. Go to the terminal and type:
[oracle@bigdatalite OSCH]$ gedit hdfs_config.xml
Then press Enter
Notice the external table name, the path to the files in HDFS where the actual data is, the column
separator used in the files to delimit the fields and the Oracle Database directory that the Oracle SQL
Connector for HDFS uses.

If you scroll down you can see details like the Oracle Database connection URL, the user schema name
and the names of the external table columns. The Oracle SQL Connector for HDFS automatically
generates the SQL command that creates the external table using this configuration file. We will see the
generated SQL command later in the exercise.

77

2. When you are done evaluating the configuration file, click on the X in the right upper corner of the window
to close it.
We are now ready to run the Oracle SQL Connector for HDFS, to connect the Oracle Database external
table to the HDFS file. Go to the terminal and type:
[oracle@bigdatalite OSCH]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar
oracle.hadoop.exttab.ExternalTable -conf hdfs_config.xml -createTable
Then press Enter
3. You will be asked to enter a password for the code to be able to login to the database user. Enter the
following information
[Enter Database Password:]: welcome1
Then Press Enter
NOTE: No text will appear while you type
As the execution of the Oracle SQL Connector for HDFS finishes, the external table creation code is
displayed in the terminal. Looking at the code for creating the table you will notice very similar syntax for
other types of external tables except for 2 lines: the preprocessor and type highlighted in the image below.
The table creation code is automatically generated by the Oracle SQL Connector for HDFS, based on the
XML configuration file.

78

Task 3: Use SQL to read data from mapped external table


1. We can now use SQL from the Oracle Database to read the HDFS file containing the monthly cash
accounts data. We have a script that queries the external table. Lets review it. Go to the terminal and
type:
[oracle@bigdatalite OSCH]$ gedit viewData.sh
Then press Enter
The script connects to the Oracle Database with the BDA user using SQL*Plus. Once connected, it will
query the external table called EXTERNAL_HDFS that points to the data from the HDFS file.

2. When you are done evaluating the script, click the X in the right upper corner of the window to close it.

79

3. Time to run the script that queries the table. Go to the terminal and type:
[oracle@bigdatalite OSCH]$ ./viewData.sh
Then press Enter
Notice how the customer-related rows in HDFS were retrieved from the Oracle Database. The monthly
cash accounts data is now available to be queried from within the Oracle Database.

4. This exercise is now finished, so we can close the terminal window. Go to the terminal and type:
[oracle@bigdatalite OSCH]$ exit
Then press Enter
Lab Summary
In this chapter we showed how data in HDFS can be queried using standard SQL right from the Oracle Database.
Data from datapump files in HDFS or from non-partitioned Hive tables on top of delimited text files can also be
queried through SQL via external tables created by OSCH, right from the Oracle Database. Note: The Oracle
NoSQL Database provides similar capabilities for querying data from the Oracle Database. However, those
capabilities are built-in in the Oracle NoSQL Database directly and not part of the Oracle SQL Connector for
HDFS.
With the data stored in HDFS all of the parallelism and striping that would naturally occur is taken full advantage
of while at the same time you can use all of the power and functionality of the Oracle Database. When using this
method for interacting with data, parallel processing is extremely important hence when using external tables,
consider enabling parallel query with this SQL command:
SQL> ALTER SESSION ENABLE PARALLEL QUERY;
Before loading data into Oracle Database from the external files created by Oracle SQL Connector for HDFS,
enable parallel DDL.
Before inserting data into an existing database table, enable parallel DML with this SQL command:
SQL> ALTER SESSION ENABLE PARALLEL DML;
Hints such as APPEND and PQ_DISTRIBUTE also improve performance when inserting data.
80

Do not proceed to the next section until you have clearance from your
instructor.

81

Lab H: Big Data SQL


Objective: Use Oracle Big Data SQL to query data in Hadoop and
Hive

! 30 minutes

Task 1: Understand the Big Data SQL Configuration


1. Launch a Terminal window using the Desktop toolbar.

2. In the Terminal window, change to the Common directory location, and then view the contents of the
bigdata.properties file. Enter the following commands at the prompt:
[oracle@bigdatalite /]$ cd /u01/bigdatasql_config
[oracle@bigdatalite bigdatasql_config]$ cat bigdata.properties
java.libjvm.file=/usr/java/latest/jre/lib/amd64/server/libjvm.so
java.classpath.oracle=/u01/nosql/kvee/lib/kvclient.jar:/u01/app/oracle/product/12.1.0.2/dbhome_1/jlib/oraclehadoop-sql.jar:/u01/app/oracle/product/12.1.0.2/dbhome_1/jlib/ora-hadoopcommon.jar:/u01/app/oracle/product/12.1.0.2/dbhome_1/jlib/oraloader.jar:/u01/a
pp/oracle/product/12.1.0.2/dbhome_1/jlib/HiveMetadata.jar:/u01/app/oracle/prod
uct/12.1.0.2/dbhome_1/jdbc/lib/ojdbc6.jar
java.classpath.hadoop=/usr/lib/hadoop/client/*:/usr/lib/hadoopmapreduce/*:/usr/lib/hadoop-mapreduce/lib/*
java.classpath.hive=/usr/lib/hive/lib/*:/u01/bigdatasql_config/hive_aux_jars/h
ive-hcatalog-core.jar
LD_LIBRARY_PATH=/usr/java/latest/jre/lib/amd64/server/:/usr/lib/hadoop/lib/nat
ive/
bigdata.cluster.default=bigdatalite

The properties, which are not specific to a hadoop cluster, include items such as the location of the Java
VM, classpaths and the LD_LIBRARY_PATH.
In addition, the last line of the file specifies the default cluster property - in this case bigdatalite. As you will
see later, the default cluster simplifies the definition of Oracle tables that are accessing data in Hadoop.
In our hands-on lab, there is a single cluster: bigdatalite. The bigdatalite subdirectory contains the
configuration files for the bigdatalite cluster.
The name of the cluster must match the name of the subdirectory (and it is case sensitive!).
Next, let's review the contents of the Cluster Directory.
3. Using the Terminal window, change to the Cluster directory and view its contents by executing the
following commands at the prompt:
[oracle@bigdatalite bigdatasql_config]$ cd /u01/bigdatasql_config/bigdatalite
[oracle@bigdatalite bigdatalite]$ ls
core-site.xml hdfs-site.xml hive-env.sh hive-site.xml mapred-site.xml

These are the files required to connect Oracle Database to HDFS and to Hive.

82

Task 2: Create the Corresponding Oracle Directory Objects


Now, you will create the Oracle directory objects that correspond to these file system directories.
The Oracle directory objects have a specific naming convention:
ORACLE_BIGDATA_CONFIG - the Oracle directory object that references the Common Directory
(/u01/bigdatasql_config)
ORACLE_BIGDATA_CL_bigdatalite - the Oracle directory object that references the Cluster Directory
(/u01/bigdatasql_config/bigdatalite)
The oracle directory object must match the physical directory name in the file system and is
case sensative
To reduce the errors with typing, feel free to open up cheat.sh using gedit. Dont forget to
go through each command to understand what is being done.
These objects should already exist, but if they dont lets create them. Ignore any object already exists errors.
1. Login to the Oracle database as the system user. Set the oracle environment first, using the oraenv
command, before logging in.
[oracle@bigdatalite bigdatalite]$ . oraenv
ORACLE_SID = [cdb] ? [press ENTER]
The Oracle base remains unchanged with value /u01/app/oracle
[oracle@bigdatalite bigdatalite]$ sqlplus system@orcl
SQL*Plus: Release 12.1.0.2.0 Production on Tue Sep 8 00:59:27 2015
Copyright (c) 1982, 2014, Oracle.

All rights reserved.

Enter password: welcome1


Last Successful login time: Wed Sep 02 2015 21:58:48 -04:00
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing
options
2. Create the directories if they dont already exist.
SQL> create or replace directory ORACLE_BIGDATA_CONFIG as
'/u01/bigdatasql_config';
Directory created.
SQL> create or replace directory "ORA_BIGDATA_CL_bigdatalite" as '';
Directory created.
SQL> grant read,write,execute on DIRECTORY ORACLE_BIGDATA_CONFIG to
moviedemo;
Grant succeeded.
SQL> exit;

83

Use the cluster name as identified by the Oracle directory object.


There is no location specified for the Cluster Directory. It is expected that the directory will
be a subdirectory of ORACLE_BIGDATA_CONFIG
Recommended Practice: In addition to the Oracle directory objects, you should also create the Big
Data SQL Multi-threaded Agent (MTA).
Multi-Threaded Agent
The MTA is already set up in this environment, because it is associated with a preconfigured cluster.
This agent bridges the metadata between Oracle Database and Hadoop. Technically, the MTA allows the
external process to be multithreaded - instead of launching a JVM for every process (which can be quite
slow).
If the MTA were not already set up, you would use the following commands to create it:

create public database link BDSQL$_bigdatalite using


'extproc_connection_data';
create public database link BDSQL$_DEFAULT_CLUSTER using
'extproc_connection_data';
Next, you will create Oracle tables that access data in HDFS.
Task 3: Create Oracle Table over Application Log in HDFS
You will create an Oracle table over data stored in HDFS and then query that data. This example will use the
ORACLE_HDFS driver; it will not leverage metadata stored in the Hive Catalog.
In this example, there was a movie application that streamed data into the Hadoop file system:
/user/oracle/moviework/applog_json. Let's review that log data:
1. Open a terminal window.
2. Execute the following command to look at the log file stored in HDFS:
[oracle@bigdatalite bigdatalite]$ hadoop fs -ls
/user/oracle/moviework/applog_json
Found 1 items
-rw-r--r-3 oracle oinstall
32557883 2014-12-30 11:25
/user/oracle/moviework/applog_json/movieapp_log_json.log

3. View the last few lines in the file, by executing the following command:
[oracle@bigdatalite bigdatalite]$ hadoop fs -tail
/user/oracle/moviework/applog_json/movieapp_log_json.log
,"recommended":"Y","activity":2}
{"custid":1135508,"movieid":240,"genreid":8,"time":"2012-1001:02:04:15","recommended":"Y","activity":5}
84

{"custid":1135508,"movieid":1092,"genreid":20,"time":"2012-1001:02:10:23","recommended":"N","activity":5}
{"custid":1135508,"movieid":4638,"genreid":8,"time":"2012-1001:02:10:54","recommended":"N","activity":7}
{"custid":1135508,"movieid":4638,"genreid":8,"time":"2012-1001:02:16:49","recommended":"N","activity":7}
{"custid":1135508,"movieid":null,"genreid":null,"time":"2012-1001:02:24:00","recommended":null,"activity":9}
{"custid":1135508,"movieid":240,"genreid":8,"time":"2012-1001:02:31:12","recommended":"Y","activity":11,"price":2.99}
{"custid":1191532,"movieid":59440,"genreid":7,"time":"2012-1001:03:11:35","recommended":"Y","activity":2}
{"custid":1191532,"movieid":null,"genreid":null,"time":"2012-1001:03:15:29","recommended":null,"activity":9}
{"custid":1191532,"movieid":59440,"genreid":7,"time":"2012-1001:03:19:24","recommended":"Y","activity":11,"price":3.99}
The file contains every click that has taken place on this web site. The JSON (JavaScript Object Notation)
log captures the following information about each interaction:

custid: the customer accessing the site movieid: the movie that the user clicked on genreid: the genre
that the movie belongs to time: when the activity occurred
recommended: did the customer click on a recommended movie?
activity: a code for the various activities that can take place, including log in/out, view a movie, purchase
a movie, show movie listings, etc.
price: the price of a purchased movie
Customer ID 1191532 accessed the movie with an ID of 59440 belonging to genre #7. She clicked on a
recommended movie and purchased a movie for $3.99.

Task 4: Review Tables Stored in Hive


Hive enables SQL access to data stored in Hadoop and NoSQL stores. There are two parts to Hive: the Hive
execution engine and the Hive Metastore.
The Hive Execution engine launches MapReduce job(s) based on the SQL that has been issued. Importantly, no
coding is required (Java, Pig, etc.). The SQL supported by Hive is still limited (SQL92), but improvements are
being made over time.
The Hive Metastore has become the standard metadata repository for data stored in Hadoop. It contains the
definitions of tables (table name, columns and data types), the location of data files (e.g. directory in HDFS), and
the routines required to parse that data (e.g. StorageHandlers, InputFormats and SerDes). There are many query
execution engines that use the Hive Metastore while bypassing the Hive execution engine. Oracle Big Data SQL
is an example of such an engine. This means that the same metadata can be shared across multiple products
(e.g. Hive, Oracle Big Data SQL, Impala, Pig, Stinger, etc.); you will see an example of this in action in the
following exercises.
Let's begin by reviewing the tables that have been defined in Hive. After reviewing these hive definitions, we'll
create tables in the Oracle Database that will query the underlying Hive data stored in HDFS:
Tables in Hive are organized into databases. In our example, several tables have been created in the default
database. Connect to Hive and investigate these tables.
1. Open a terminal window and execute the following command at the command prompt:
[oracle@bigdatalite bigdatalite]$ hive
15/09/08 01:40:19 WARN conf.HiveConf: DEPRECATED: Configuration property
hive.metastore.local no longer has any effect. Make sure to provide a valid
value for hive.metastore.uris if you are connecting to a remote metastore.
85

Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hivecommon-0.13.1-cdh5.3.0.jar!/hive-log4j.properties


This command opens the Hive command line interface (CLI). You can ignore the warning messages generated by
the CLI.
2. At the hive> prompt, enter the following command to display the list of tables in the default database:
hive> show tables;
OK
cust
movie
movie_rating
movie_updates
movie_view
movieapp_log_avro
movieapp_log_json
movieapp_log_odistage
movieapp_log_stage
movielog
session_stats
user_movie
Time taken: 1.127 seconds, Fetched: 12 row(s)

As shown in the output, several tables have been defined in the database. There are tables defined over Avro
data, JSON data and tab delimited text files.
WHAT IS AVRO?
Apache AVRO is a very popular deserialization format in the Hadoop
Technology stack. It uses JSON to define a data structures schema
and defines a data format designed to support data-intensive
applications. A sample record is shown to the right.

3. The first table is very simple and is equivalent to the external table that was defined in Oracle Database
in the previous exercise. Review the definition of the table by executing the following command at the hive
prompt:
hive> show create table movielog;
OK
CREATE EXTERNAL TABLE `movielog`(
`click` string)
1
ROW FORMAT DELIMITED
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/applog_json'
TBLPROPERTIES (
'transient_lastDdlTime'='1410630461')
Time taken: 0.569 seconds, Fetched: 12 row(s)
86

The DDL for the table is displayed.


1

There is a single string column called click - and the table is referring to data stored in the
/user/oracle/moviework/applog_json folder.
There is no special processing of the JSON data; i.e. no routine is transforming the attributes into
columns. The table is simply displaying the JSON as a line of text.
4. Next, query the data in the movielog table by executing the following command:
hive> select * from movielog limit 10;
OK
{"custid":1185972,"movieid":null,"genreid":null,"time":"2012-0701:00:00:07","recommended":null,"activity":8}
{"custid":1354924,"movieid":1948,"genreid":9,"time":"2012-0701:00:00:22","recommended":"N","activity":7}
{"custid":1083711,"movieid":null,"genreid":null,"time":"2012-0701:00:00:26","recommended":null,"activity":9}
{"custid":1234182,"movieid":11547,"genreid":6,"time":"2012-0701:00:00:32","recommended":"Y","activity":7}
{"custid":1010220,"movieid":11547,"genreid":6,"time":"2012-0701:00:00:42","recommended":"Y","activity":6}
{"custid":1143971,"movieid":null,"genreid":null,"time":"2012-0701:00:00:43","recommended":null,"activity":8}
{"custid":1253676,"movieid":null,"genreid":null,"time":"2012-0701:00:00:50","recommended":null,"activity":9}
{"custid":1351777,"movieid":608,"genreid":6,"time":"2012-0701:00:01:03","recommended":"N","activity":7}
{"custid":1143971,"movieid":null,"genreid":null,"time":"2012-0701:00:01:07","recommended":null,"activity":9}
{"custid":1363545,"movieid":27205,"genreid":9,"time":"2012-0701:00:01:18","recommended":"Y","activity":7}
Time taken: 1.835 seconds, Fetched: 10 row(s)

Because there are no columns in the select list and no filters applied, the query simply scans the file and
returning the results.
No MapReduce job is executed.
There are more useful ways to query the JSON data. The next steps will show how Hive can parse the
JSON data using a serializer/deserializer - or SerDe.
5. The second table queries that same file - however this time it is using a SerDe that will translate the
attributes into columns. Review the definition of the table by executing the following command:
hive> show create table movieapp_log_json;
OK
CREATE EXTERNAL TABLE `movieapp_log_json`(
`custid` int COMMENT 'from deserializer',
`movieid` int COMMENT 'from deserializer',
`genreid` int COMMENT 'from deserializer',
`time` string COMMENT 'from deserializer',
`recommended` string COMMENT 'from deserializer',
`activity` int COMMENT 'from deserializer',
`rating` int COMMENT 'from deserializer',
`price` float COMMENT 'from deserializer',
`position` int COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT

87

'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/applog_json'
TBLPROPERTIES (
'transient_lastDdlTime'='1410635962')
The DDL for the second table is shown.
1

There are columns defined for each field in the JSON document - making it much easier to understand and
query the data.

A java class org.apache.hive.hcatalog.data.JsonSerDe is used to deserialize the JSON file.


This is also an illustration of Hadoop's schema on read paradigm; a file is stored in HDFS, but there is no schema
associated with it until that file is read. Our examples are using two different schemas to read that same data;
these schemas are encapsulated by the Hive tables movielog and movieapp_log_json.
6. Execute the following query against the movieapp_log_json table to find activity where movies were
highly rated:
hive> select * from movieapp_log_json where rating > 4;
1444055
675
12
2012-09-30:17:25:01
N
1
1431502
94730 30
2012-09-30:18:14:41
Y
1
1040916
1135249
14
2012-09-30:18:43:58
Y
NULL
1015245
48988 6
2012-09-30:18:55:11
N
1
1201929
9913 9
2012-09-30:19:12:19
N
1
1171159
116
8
2012-09-30:21:18:37
Y
1
1094886
217
11
2012-09-30:22:44:21
N
1
1178337
19908 6
2012-09-30:23:27:47
N
1
1084372
544
6
2012-10-01:01:11:35
N
1
Time taken: 51.427 seconds, Fetched: 704 row(s)

5
5
1

NULL
NULL
5

NULL
NULL
NULL

5
5
5
5
5
5

NULL
NULL
NULL
NULL
NULL
NULL

NULL
NULL
NULL
NULL
NULL
NULL

This is a much better way to query and view the data than in our previous table. The data is serialized into nice
neat columns.
The Hive query execution engine converted this query into a MapReduce job.
The author of the query does not need to worry about the underlying implementation - Hive handles this
automatically.
7. At the hive> prompt, execute the exit; command to close the Hive CLI.
hive> exit;

88

Task 5: Use Oracle Big Data SQL to Create and Query Hive Tables
Oracle Big Data SQL is able to leverage the Hive metadata when creating and querying tables. In this section,
you will create Oracle tables over two Hive tables: movieapp_log_json and movieapp_log_avro. Oracle BigData
SQL will utilize the existing InputFormats and SerDes required to process this data.
1. Login to SQL Plus as moviedemo
[oracle@bigdatalite bigdatalite]$ sqlplus moviedemo/welcome1@orcl
SQL*Plus: Release 12.1.0.2.0 Production on Tue Sep 8 01:47:47 2015
Copyright (c) 1982, 2014, Oracle.

All rights reserved.

Last Successful login time: Tue Sep 08 2015 01:33:25 -04:00


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing
options
SQL> CREATE TABLE movieapp_log_json (
custid
INTEGER ,
movieid
INTEGER ,
genreid
INTEGER ,
time
VARCHAR2 (20) ,
recommended VARCHAR2 (4) ,
activity
NUMBER,
rating
INTEGER,
price
NUMBER
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
)
REJECT LIMIT UNLIMITED;
2
3
4
5
6
7
8
9
Table created.

10

11

12

13

14

15

16

Notice the new ORACLE_HIVE access driver type. This access driver invokes Oracle Big Data
SQL at query compilation time to retrieve the metadata details from the Hive Metastore.

Why did you just create the same table, movieapp_log_json twice?
____________________________________________________________________________________
____________________________________________________________________________________
By default, it will query the metastore for a table name that matches the name of the external table:
movieapp_log_json. As you will see later, this default can be overridden using ACCESS PARAMETERS.
2. Query the table using the following select statement:
SQL> select * from movieapp_log_json where rating > 4 and rownum < 20;
89

CUSTID
MOVIEID
GENREID TIME
RECO
ACTIVITY
RATING
---------- ---------- ---------- -------------------- ---- ---------- --------PRICE
---------1126174
1647
9 2012-07-01:00:20:11 N
1
5
1161010

15121

25 2012-07-01:01:35:04

1161861

752

45 2012-07-01:01:55:40

19 rows selected.
The query output is shown above.
As mentioned earlier, at query compilation time, Oracle Big Data SQL queries the Hive Metastore for all
the information required to select data. This metadata includes the location of the data and the classes
required to process the data (e.g. StorageHandlers, InputFormats and SerDes).
In this example, Oracle Big Data SQL scanned the files found in the /user/oracle/movie/moviework
/applog_json directory and then used the Hive SerDe to parse each JSON document.
In a true Oracle Big Data Appliance environment, the input splits would be processed in parallel across
the nodes of the cluster by the Big Data SQL Server, the data would then be filtered locally using Smart
Scan, and only the filtered results (rows and columns) would be returned to Oracle Database.
3. There is a second Hive table over the same movie log content - except the data is in Avro format - not
JSON text format. Create an Oracle table over that Avro-based Hive table using the following command:
SQL> CREATE TABLE mylogdata (
custid
INTEGER ,
movieid
INTEGER ,
genreid
INTEGER ,
time
VARCHAR2 (20) ,
recommended VARCHAR2 (4) ,
activity
NUMBER,
rating
INTEGER,
price
NUMBER
1
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS ( com.oracle.bigdata.tablename=default.movieapp_log_avro
)
)
REJECT LIMIT UNLIMITED;
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Table created.
REJECT LIMIT UNLIMITED;
1

In this instance, the Oracle table name does not match the Hive table name. Therefore, an ACCESS
PARAMETER was specified that references the Hive table (default.movieapp_log_avro).
90

4. Query the mylogdata table using the following command: SELECT custid, movieid, time FROM
mylogdata; Result: The query output will be similar to this:

SQL> SELECT custid, movieid, time FROM mylogdata;


1191532
1106264
1106264
1061810
1061810
1061810
1061810
1135508

59440
0
28
9529
9529
0
121
240

2012-10-01:01:30:27
2012-10-01:01:32:04
2012-10-01:01:35:19
2012-10-01:01:37:42
2012-10-01:01:46:31
2012-10-01:01:51:36
2012-10-01:01:56:42
2012-10-01:01:58:00

CUSTID
MOVIEID TIME
---------- ---------- -------------------1135508
240 2012-10-01:02:04:15
1135508
1092 2012-10-01:02:10:23
1135508
4638 2012-10-01:02:10:54
1135508
4638 2012-10-01:02:16:49
1135508
0 2012-10-01:02:24:00
1135508
240 2012-10-01:02:31:12
1191532 59440 2012-10-01:03:11:35
1191532
0 2012-10-01:03:15:29
1191532 59440 2012-10-01:03:19:24
299748 rows selected.

Oracle Big Data SQL utilized the Avro InputFormat to query the data.
Now, to illustrate how Oracle Big Data SQL uses the Hive Metastore at query compilation to determine
query execution parameters, you will change the definition of the hive table movieapp_log_data. In Hive,
alter the table's LOCATION field so that it points to a file that containing only two records.
5. Open up a NEW terminal window, keep the SQL window open. Invoke the hive CLI, and then change
the location field and query the table by executing the following three commands:
[oracle@bigdatalite bigdatalite]$ hive
15/09/08 01:59:11 WARN conf.HiveConf: DEPRECATED: Configuration property
hive.metastore.local no longer has any effect. Make sure to provide a valid
value for hive.metastore.uris if you are connecting to a remote metastore.
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hivecommon-0.13.1-cdh5.3.0.jar!/hive-log4j.properties
hive> ALTER TABLE movieapp_log_json SET LOCATION
"hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/two_recs";
OK
Time taken: 1.371 seconds

91

What was the original location of the data in the original definition?
____________________________________________________________________________________
hive> SELECT * FROM movieapp_log_json;
OK
1185972
NULL NULL 2012-07-01:00:00:07
1354924
1948 9
2012-07-01:00:00:22
Time taken: 1.087 seconds, Fetched: 2 row(s)

NULL
N

8
7

NULL
NULL

NULL
NULL

NULL
NULL

The Hive table returns the file's only two records, which look something like this (your two rows may show
different data):
6. Return to the SQL Plus window and execute the same query.
SQL> SELECT * FROM movieapp_log_json;
Oracle Big Data SQL queried the Hive Metastore and picked up the change in LOCATION without
changing the definition of the table.
The Oracle table returns the same two rows (your two rows will be the same as returned in Hive).

8. Finally, reset the Hive table and then confirm that there are more than two rows. Execute the following
commands at the command prompt
hive> ALTER TABLE movieapp_log_json SET LOCATION
"hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/applog_json";
OK
Time taken: 0.275 seconds
hive> select * from movieapp_log_json limit 10;
OK
1185972
NULL NULL 2012-07-01:00:00:07
NULL 8
NULL NULL NULL
1354924
1948 9
2012-07-01:00:00:22
N
7
NULL NULL NULL
1083711
NULL NULL 2012-07-01:00:00:26
NULL 9
NULL NULL NULL
1234182
11547 6
2012-07-01:00:00:32
Y
7
NULL NULL NULL
1010220
11547 6
2012-07-01:00:00:42
Y
6
NULL NULL NULL
1143971
NULL NULL 2012-07-01:00:00:43
NULL 8
NULL NULL NULL
1253676
NULL NULL 2012-07-01:00:00:50
NULL 9
NULL NULL NULL
1351777
608
6
2012-07-01:00:01:03
N
7
NULL NULL NULL
1143971
NULL NULL 2012-07-01:00:01:07
NULL 9
NULL NULL NULL
1363545
27205 9
2012-07-01:00:01:18
Y
7
NULL NULL NULL
Time taken: 0.204 seconds, Fetched: 10 row(s)

The query should return 10 rows.

92

Lab Summary
Organizations are looking for innovative ways to manage more data from more sources than ever before.
Although technologies like Hadoop and NoSQL offer specific ways of addressing big data problems, they can
introduce data silos that complicate the data access and analysis needed to generate critical insights. To
maximize the value from information and deliver on the promise of big data, companies need to evolve their data
management architecture into a big data management system that seamlessly integrates all types of data from a
variety of sources, including Hadoop, relational, and NoSQL.
Big Data SQL breaks down data silos to simplify information access and discovery, the offering allows customers
to run one SQL query across Hadoop, NoSQL, and Oracle Database, minimizing data movement while increasing
performance and virtually eliminating data silos. Oracle Big Data SQL enables customers to gain a competitive
advantage by making it easier to uncover insights faster, while protecting data security and enforcing governance.
Furthermore, this approach enables organizations to leverage existing SQL investments in both people skills and
applications.
In this lab, you created tables over HDFS. You then joined these tables with an existing Oracle database table.
You also created Oracle tables over Hive, changed the Hive definitions and saw how the data changed. All in a
matter of minutes without knowing any java!

93

Appendix
Start Cloudera Service
1. Login as root
2. service cloudera-scm-server start
Start Cloudera Agent
1. Login as root
2. service cloudera-scm-agent start
Various Utilities in BigDataLite VM
There are a number of utilities that can be accessed via the web browser on the BigDataLite VM.

94

95

96

Vous aimerez peut-être aussi