Vous êtes sur la page 1sur 28

IBM Software An IBM Proof of Technology

Accessing Hadoop Data Using Hive


Unit 2: Working with Hive DDL
An IBM Proof of Technology
Catalog Number

© Copyright IBM Corporation, 2015


US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM Software

Contents
LAB 2 WORKING WITH HIVE DDL............................................................................................................................... 4
1.1 SETTING HDFS USER PERMISSIONS ......................................................................................................... 5
1.2 ACCESSING THE HIVE BEELINE CLI ........................................................................................................... 6
1.3 WORKING WITH DATABASES IN HIVE .......................................................................................................... 7
1.4 EXPLORING OUR SAMPLE DATASET ......................................................................................................... 10
1.4.1 FINDING THE SAMPLE DATA ...................................................................................................... 10
1.4.2 SAMPLE DATA DESCRIPTIONS ................................................................................................... 13
1.5 TABLES IN HIVE ..................................................................................................................................... 16
1.5.1 MANAGED NON-PARTITIONED TABLES ....................................................................................... 16
1.5.2 MANAGED PARTITIONED TABLES ............................................................................................... 20
1.5.3 EXTERNAL TABLE .................................................................................................................... 21
1.6 SUMMARY ............................................................................................................................................. 24

Contents Page 3
IBM Software

Lab 2 Working with Hive DDL


Before we can begin working with and analyzing data in Hive we must first use Hive’s data manipulation
language. Using Hive DML will enable us to create databases, tables, partitions and more which we can
later load with data that can be queried and manipulated.

After completing this hands-on lab, you will be able to:

• Create, Alter, and Drop Databases in Hive.


• Create Managed, External, and Partitioned Tables in Hive.
• Be able to locate Databases and Tables in HDFS.

Allow 1 to 1.5 hours to complete this section of lab.

This version of the lab was designed using the IBM BigInsights 4.0 Quick Start Edition. Throughout this
lab it is assumed that you will be using the following account login information:

Username Password

VM image setup screen root password

Linux virtuser password

Ambari admin admin

If you are continuing this series of hands on labs immediately after completing Accessing Hadoop Data
Using Hive Unit 1: Exploring Hive, you may move on to section 1.1 of this lab. Otherwise please refer to
Accessing Hadoop Data Using Hive Unit 1: Exploring Hive Section 1.1 to get started. (All Hadoop
components should be running)

Page 4 Hive DDL


IBM Software

1.1 Setting HDFS User Permissions


Before we continue, we will make an ownership change to one of the directories on HDFS that Hive
uses. This will allow us to continue the rest of the lab without having to worry about the details of Hadoop
permissions.

__1. Open a new Linux command console.

__2. We will execute a few commands:

$ su – root

Enter the root password when prompted. Then execute the following commands:

$ su - hdfs

$ hadoop fs –chmod –R 777 /user

Once you run those commands enter the exit command twice to return back to the virtuser id.

$ exit

$ exit

Now you can continue to the next step of the lab.

Hands-on-Lab Page 5
IBM Software

1.2 Accessing the Hive Beeline CLI


In this section we will navigate to the Hive Beeline CLI and start an interactive CLI session.

__1. In a Linux terminal change to the Hive bin directory

$ cd /usr/iop/4.0.0.0/hive/bin

__2. Start an interactive Hive shell session.

$ ./beeline

__3. Connect to Hive.

beeline> !connect jdbc:hive2://rvm.svl.ibm.com:10000 virtuser password


org.apache.hive.jdbc.HiveDriver

__4. Run the SHOW DATABASES statement from within the interactive Hive session.

hive> SHOW DATABASES;

Page 6 Hive DDL


IBM Software

1.3 Working with Databases in Hive


If we neglect to create a new database in Hive, then the “default” database will be used. Let’s create a
new database and work with it. In this exercise we will create two databases in the Hive system. One of
them will be used for future exercises. The other will be deleted.

__1. In the Hive Beeline shell create a database called testDB.

hive> CREATE DATABASE testDB;

__2. Let’s confirm that the new database was added to Hive’s catalog.

hive> SHOW DATABASES;

Notice that Hive converted “testDB” to lowercase.

__3. Now that we have created a new database, let’s describe it.

hive> DESCRIBE DATABASE testdb;

Hands-on-Lab Page 7
IBM Software

The DESCRIBE DATABASE shows us the location of testdb on HDFS. Notice that the testdb.db
schema is being stored inside HDFS in the /apps/hive/warehouse directory.

__4. Let’s confirm that the new testdb.db directory was in fact created on HDFS. Open a second
Linux terminal.

__5. Check HDFS to confirm our new database directory was created.

$ hadoop fs -ls /apps/hive/warehouse

The testdb.db directory WAS created.

Keep this second Linux console open for the rest of this lab. We will continue to use it to
look at HDFS. You will be going back and forth between the Hive console and this Linux
console.

__6. Go back to the Beeline console. Add some information to the DBPROPERTIES metadata for the
testdb database. We do this by using the ALTER DATABASE syntax.

hive> ALTER DATABASE testdb SET DBPROPERTIES ('creator' = 'bigdatarockstar');

__7. Let’s view the extended details of our testdb database.

Page 8 Hive DDL


IBM Software

hive> DESCRIBE DATABASE EXTENDED testdb;

Notice the updated database properties.

__8. Go ahead and delete the testdb database.

hive> DROP DATABASE testdb CASCADE;

Notice the CASCADE keyword. That is optional. Using it will cause Hive to delete all the tables in
your database (if there are any) before dropping the database. If you try to delete a database
that has tables without the CASCADE keyword, Hive won’t let you.

__9. Confirm that testdb is no longer in the Hive metastore catalog.

hive> SHOW DATABASES;

__10. Now we are going to create a database that will house the tables we will use for many of the
exercises in this course. This database will be called “computersalesdb”.

hive> CREATE DATABASE computersalesdb;

__11. Verify the DB was created. Note the location of the database directory in HDFS.

hive> DESCRIBE DATABASE computersalesdb;

Hands-on-Lab Page 9
IBM Software

We can see that the computersalesdb does in fact exist and the new directory was created on
HDFS in the /apps/hive/warehouse/computersalesdb.db directory.

__12. Tell Hive to Use the computersalesdb (we will use this database for the rest of this interactive
session).

hive> USE computersalesdb;

Keep your CLI open – we will be using it in the upcoming exercises.

1.4 Exploring Our Sample Dataset


Before we begin creating tables in our new database it is important to understand what data is in our
sample files and how that data is structured.

Earlier in this course you placed the lab data in the /home/virtuser directory. Now we will take a look at
the contents of those files.

1.4.1 Finding the Sample Data

__1. Click the virtuser’s Home shortcut on the desktop

A new window will open that shows us the contents of our /home/virtuser directory.

Page 10 Hive DDL


IBM Software

__2. Navigate to the following directory sampleData->Computer_Business

__3. In Hive there is not an easy way to remove the header row from a file. To make things easy we
have two directories – WithoutHeaders and WithRowHeaders.

WithRowHeaders - Contains 3 data files in csv format. The first row in each file is a header row.
We created this directory just so you can see the metadata of the table (what data each column
holds). You will only be using this directory in this exercise – and only to examine what the data
looks like.

WithoutHeaders – Contains the same 3 data files that the WithRowHeaders directory has,
EXCEPT the first rows (header data) are removed from this data. This data is ready to be used
with Hive.

__4. Let’s check out our data. Navigate to the WithRowHeaders directory.

Hands-on-Lab Page 11
IBM Software

__5. To view the contents of one of the files, right-click the file and then click “Open” from the menu.
Then click the “Display” button on the pop up box. Examine each of the 3 files.

Page 12 Hive DDL


IBM Software

1.4.2 Sample Data Descriptions

Our sample data is from a fictitious computer retailer. The company sells computer parts and
generally serves a single State in the country.

Customer.csv:

Purpose: Hold customer records.

Columns:

FNAME LNAME STATUS TELNO CUSTOMER_ID CITY|ZIP

City and Zip


Active or
Customer’s Customer’s Telephone Customer’s code separated
Inactive
First Name Last Name # unique ID by the “|”
status
character.

Example of contents:

Hands-on-Lab Page 13
IBM Software

Product.csv:

Purpose: Hold product records.

Columns:

PROD_ PROD_ PACKAGED_


DESCRIPTION CATEGORY QTY_ON_HAND
NAME NUM WITH

Colon
separated list of
Description of Category Quantity of Unique
Name of things that
computer product product in product
product come in
product belongs to warehouse number
package with
product.

Example of contents:

Page 14 Hive DDL


IBM Software

Sales.csv:

Purpose: Holds all historical sales records. Company updates once a month.

Columns:

CUST_ID PROD_NUM QTY DATE SALES_ID

ID of
customer ID of product that Unique sale
QTY purchased Date of sale
who made was purchased ID
purchase

Example of contents:

Hands-on-Lab Page 15
IBM Software

1.5 Tables in Hive

1.5.1 Managed Non-Partitioned Tables

The first table we will create in Hive is the products table. This table will be fully managed by Hive and
will not contain any partitions.

__1. In the CLI, create the new products table in Hive.


hive> CREATE TABLE products
(
prod_name STRING,
description STRING,
category STRING,
qty_on_hand INT,
prod_num STRING,
packaged_with ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ':'
STORED AS TEXTFILE;

Note the data types we have assigned to the different columns. The packaged_with column is of
special interest – it is designated as an Array of Strings. The array will hold data that is separated
by the colon “:” character - e.g. satacable:manual. We also tell Hive that the columns in our rows
are delimited by commas “,”. The last line tells Hive that our data file is a plain text file.

__2. Ask Hive to show us the tables in our database.


hive> SHOW TABLES IN computersalesdb;

Page 16 Hive DDL


IBM Software

We can see that only one table exists in our database and it is the new products table we just
created.

__3. Add a note to the TBLPROPERTIES for our new products table.
hive> ALTER TABLE products SET TBLPROPERTIES (
'details' = 'This table holds products');

__4. List the extended details of the products table.


hive> DESCRIBE EXTENDED products;

Beeline’s default setting here makes the results hard to read. Let’s adjust Beeline to display data
in a different format so we can see all of the output.

Inside of Beeline:
hive> !set outputformat vertical

Hands-on-Lab Page 17
IBM Software

Now, rerun the DESCRIBE EXTENDED products; command.

That is a lot of details! Notice there is some interesting info including the location of this table
within HDFS: /apps/hive/warehouse/computersalesdb.db/products

__5. Let’s verify that the products directory was created on HDFS in the location listed above. Run the
HDFS ls command from within a Linux console. First list the contents of the database directory,
then list the contents of the products table directory.

$ hadoop fs -ls /apps/hive/warehouse/computersalesdb.db;

$ hadoop fs -ls /apps/hive/warehouse/computersalesdb.db/products;

The first command confirms that there is in fact a products table directory on HDFS. The second
command shows that there are no files within the products directory yet. This directory will be
empty until we load data into the products table in a later exercise.

Page 18 Hive DDL


IBM Software

__6. Imagine that our fictitious computer company adds sales data to a “sales_staging” table at the
end of each month. From this sales_staging table they then move the data they want to analyze
into a partitioned “sales” table. The partitioned sales table is the one they actual use for their
analysis.

Now that we know how to create tables, we will create one more managed non-partitioned table
called “sales_staging”. This table will hold ALL of the sales data from the sales.csv file. In later
exercises we will actually split this sales_staging data into a partitioned table called “sales”.

In the CLI, create the new sales_staging table in Hive.


hive> CREATE TABLE sales_staging
(
cust_id STRING,
prod_num STRING,
qty INT,
sale_date STRING,
sales_id STRING
)
COMMENT 'Staging table for sales data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

__7. We can now assume that the new sales_staging table directory is on HDFS in the following
folder: /apps/hive/warehouse/computersalesdb.db/sales_staging. Let’s quickly confirm by
entering the following command in the Linux console:

$ hadoop fs –ls /apps/hive/warehouse/computersalesdb.db;

Sure enough, the sales_staging directory was created and is now being managed by Hive.

__8. Ask Hive to show us the tables in our database. Confirm your new sales_staging table is in the
Hive catalog.

Hands-on-Lab Page 19
IBM Software

hive> SHOW TABLES;

__9. Let’s pretend that we have decided we want to update some of our column metadata. We will
change the sale_date column in the sales_staging table from a STRING type to a DATE type.
hive> ALTER TABLE sales_staging CHANGE sale_date sale_date DATE;

1.5.2 Managed Partitioned Tables

__1. Now we will create a partitioned table. This table will be a managed table – Hive will manage the
metadata and lifecycle of this table, just like the tables we previously created.

In the CLI create the sales table. This table will be partitioned on the sales date.
hive> CREATE TABLE sales
(
cust_id STRING,
prod_num STRING,
qty INT,
sales_id STRING
)
COMMENT 'Table for analysis of sales data'
PARTITIONED BY (sales_date STRING)
ROW FORMAT DELIMITED

Page 20 Hive DDL


IBM Software

FIELDS TERMINATED BY ','


STORED AS TEXTFILE;

Notice that we list the sales_date in the PARTITIONED BY clause instead of listing it in the data
column metadata. Since we are partitioning on sales_date, Hive will keep track of the dates for
us outside of the actual data.

__2. Let’s view the extended details of our new table.


hive> DESCRIBE EXTENDED sales;

We can see a directory was created on HDFS at:


/apps/hive/warehouse/computersalesdb.db/sales.

When we later put data into this table, a new directory will be created inside the sales directory
for EACH partition.

The following line in the details shows us how our table is partitioned: partitionKeys:
[FieldSchema(name:sales_date, typestring,comment:null)]

1.5.3 External Table

Another department in our fictitious computer company would like to be able to analyze the customer
data. It therefore makes sense that we setup the customer table as EXTERNAL so they can use their
tools on the data and we can use ours (Hive). We will place a copy of the Customer.csv file in HDFS and
then create a new table in Hive that points to this data.

__1. First we need to create a new directory – let’s call it “shared_hive_data” - on HDFS that can
house our Customer.csv data file. Let’s put this in the /tmp directory. We will run the command to
make the new directory from the Linux console. Open the Linux console and enter the following:

Hands-on-Lab Page 21
IBM Software

$ hadoop fs -mkdir /tmp/shared_hive_data;

__2. Now we will move a copy of the Customer.csv file into the /tmp/shared_hive_data directory. We
can run the command to do this from within the Linux console.

$ hadoop fs -put
/home/virtuser/sampleData/Computer_Business/WithoutHeaders/Customer.csv
/tmp/shared_hive_data/Customer.csv;

__3. Confirm that Customer.csv has been copied successfully into HDFS. Enter the following
command in the Linux console.

$ hadoop fs -ls /tmp/shared_hive_data/;

If your output looks similar to the screen capture above, then that is good!

If you’d like, run the “cat” command to verify the data is in the Customer.csv file on HDFS.

$ hadoop fs -cat /tmp/shared_hive_data/Customer.csv;

__4. Now we just need to define our external customer table. Go back to the Hive Beeline console
and create the table.

hive> CREATE EXTERNAL TABLE customer


(
fname STRING,
lname STRING,
status STRING,
telno STRING,
Page 22 Hive DDL
IBM Software

customer_id STRING,
city_zip STRUCT<city:STRING, zip:STRING>
)
COMMENT 'External table for customer data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LOCATION '/tmp/shared_hive_data/';

There are a few things to note here. First, we use the EXTERNAL keyword in the CREATE line.
We are leaving out the “stored as line” when we are creating this external table since the default
format is already set to TEXTFILE. We also add the LOCATION ‘location/of/datadirectory’ line to
the end of the statement.

Hive expects that LOCATION will be a directory, not a file. For our exercises we will only have a
single file in the shared_hive_data directory. However, you could put multiple customer data files
into the shared_hive_data directory and Hive would use them all for your EXTERNAL table! That
is a common scenario.

__5. Let’s view the extended details of our new table.


hive> DESCRIBE EXTENDED customer;

You can see that the location points to the /tmp/shared_hive_data directory we designated on
HDFS. Also notice towards the end of the output that tableType:EXTERNAL_TABLE.

__6. Let’s set the Beeline output format to table style. This will give us cleaner output when we run
queries.
hive> !set outputformat table

Hands-on-Lab Page 23
IBM Software

__7. Since customer is an External table and Hive already knows where the data is sitting, you can
already begin to run queries on this table. Reward yourself for completing this lab by running a
simple select query as proof.
hive> SELECT * FROM customer LIMIT 5;

This Hive query didn’t run any MapReduce jobs. Hive is able to read the data file and write the
results to the CLI without using MapReduce, since this is a simple SELECT and LIMIT
statement.

1.6 Summary
Congratulations! You now know how to create, alter and remove databases in Hive. You can create
managed, external, and partitioned Hive tables. You also are familiar with the sample data that will be
used in this course. You may move on to the next Unit.

Page 24 Hive DDL


NOTES
NOTES
© Copyright IBM Corporation 2015.

The information contained in these materials is provided for


informational purposes only, and is provided AS IS without warranty
of any kind, express or implied. IBM shall not be responsible for any
damages arising out of the use of, or otherwise related to, these
materials. Nothing contained in these materials is intended to, nor
shall have the effect of, creating any warranties or representations
from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of
IBM software. References in these materials to IBM products,
programs, or services do not imply that they will be available in all
countries in which IBM operates. This information is based on
current IBM product plans and strategy, which are subject to change
by IBM without notice. Product release dates and/or capabilities
referenced in these materials may change at any time at IBM’s sole
discretion based on market opportunities or other factors, and are not
intended to be a commitment to future product or feature availability
in any way.

IBM, the IBM logo and ibm.com are trademarks of International


Business Machines Corp., registered in many jurisdictions
worldwide. Other product and service names might be trademarks of
IBM or other companies. A current list of IBM trademarks is
available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml.

Vous aimerez peut-être aussi