Académique Documents
Professionnel Documents
Culture Documents
Contents
LAB 2 WORKING WITH HIVE DDL............................................................................................................................... 4
1.1 SETTING HDFS USER PERMISSIONS ......................................................................................................... 5
1.2 ACCESSING THE HIVE BEELINE CLI ........................................................................................................... 6
1.3 WORKING WITH DATABASES IN HIVE .......................................................................................................... 7
1.4 EXPLORING OUR SAMPLE DATASET ......................................................................................................... 10
1.4.1 FINDING THE SAMPLE DATA ...................................................................................................... 10
1.4.2 SAMPLE DATA DESCRIPTIONS ................................................................................................... 13
1.5 TABLES IN HIVE ..................................................................................................................................... 16
1.5.1 MANAGED NON-PARTITIONED TABLES ....................................................................................... 16
1.5.2 MANAGED PARTITIONED TABLES ............................................................................................... 20
1.5.3 EXTERNAL TABLE .................................................................................................................... 21
1.6 SUMMARY ............................................................................................................................................. 24
Contents Page 3
IBM Software
This version of the lab was designed using the IBM BigInsights 4.0 Quick Start Edition. Throughout this
lab it is assumed that you will be using the following account login information:
Username Password
If you are continuing this series of hands on labs immediately after completing Accessing Hadoop Data
Using Hive Unit 1: Exploring Hive, you may move on to section 1.1 of this lab. Otherwise please refer to
Accessing Hadoop Data Using Hive Unit 1: Exploring Hive Section 1.1 to get started. (All Hadoop
components should be running)
$ su – root
Enter the root password when prompted. Then execute the following commands:
$ su - hdfs
Once you run those commands enter the exit command twice to return back to the virtuser id.
$ exit
$ exit
Hands-on-Lab Page 5
IBM Software
$ cd /usr/iop/4.0.0.0/hive/bin
$ ./beeline
__4. Run the SHOW DATABASES statement from within the interactive Hive session.
__2. Let’s confirm that the new database was added to Hive’s catalog.
__3. Now that we have created a new database, let’s describe it.
Hands-on-Lab Page 7
IBM Software
The DESCRIBE DATABASE shows us the location of testdb on HDFS. Notice that the testdb.db
schema is being stored inside HDFS in the /apps/hive/warehouse directory.
__4. Let’s confirm that the new testdb.db directory was in fact created on HDFS. Open a second
Linux terminal.
__5. Check HDFS to confirm our new database directory was created.
Keep this second Linux console open for the rest of this lab. We will continue to use it to
look at HDFS. You will be going back and forth between the Hive console and this Linux
console.
__6. Go back to the Beeline console. Add some information to the DBPROPERTIES metadata for the
testdb database. We do this by using the ALTER DATABASE syntax.
Notice the CASCADE keyword. That is optional. Using it will cause Hive to delete all the tables in
your database (if there are any) before dropping the database. If you try to delete a database
that has tables without the CASCADE keyword, Hive won’t let you.
__10. Now we are going to create a database that will house the tables we will use for many of the
exercises in this course. This database will be called “computersalesdb”.
__11. Verify the DB was created. Note the location of the database directory in HDFS.
Hands-on-Lab Page 9
IBM Software
We can see that the computersalesdb does in fact exist and the new directory was created on
HDFS in the /apps/hive/warehouse/computersalesdb.db directory.
__12. Tell Hive to Use the computersalesdb (we will use this database for the rest of this interactive
session).
Earlier in this course you placed the lab data in the /home/virtuser directory. Now we will take a look at
the contents of those files.
A new window will open that shows us the contents of our /home/virtuser directory.
__3. In Hive there is not an easy way to remove the header row from a file. To make things easy we
have two directories – WithoutHeaders and WithRowHeaders.
WithRowHeaders - Contains 3 data files in csv format. The first row in each file is a header row.
We created this directory just so you can see the metadata of the table (what data each column
holds). You will only be using this directory in this exercise – and only to examine what the data
looks like.
WithoutHeaders – Contains the same 3 data files that the WithRowHeaders directory has,
EXCEPT the first rows (header data) are removed from this data. This data is ready to be used
with Hive.
__4. Let’s check out our data. Navigate to the WithRowHeaders directory.
Hands-on-Lab Page 11
IBM Software
__5. To view the contents of one of the files, right-click the file and then click “Open” from the menu.
Then click the “Display” button on the pop up box. Examine each of the 3 files.
Our sample data is from a fictitious computer retailer. The company sells computer parts and
generally serves a single State in the country.
Customer.csv:
Columns:
Example of contents:
Hands-on-Lab Page 13
IBM Software
Product.csv:
Columns:
Colon
separated list of
Description of Category Quantity of Unique
Name of things that
computer product product in product
product come in
product belongs to warehouse number
package with
product.
Example of contents:
Sales.csv:
Purpose: Holds all historical sales records. Company updates once a month.
Columns:
ID of
customer ID of product that Unique sale
QTY purchased Date of sale
who made was purchased ID
purchase
Example of contents:
Hands-on-Lab Page 15
IBM Software
The first table we will create in Hive is the products table. This table will be fully managed by Hive and
will not contain any partitions.
Note the data types we have assigned to the different columns. The packaged_with column is of
special interest – it is designated as an Array of Strings. The array will hold data that is separated
by the colon “:” character - e.g. satacable:manual. We also tell Hive that the columns in our rows
are delimited by commas “,”. The last line tells Hive that our data file is a plain text file.
We can see that only one table exists in our database and it is the new products table we just
created.
__3. Add a note to the TBLPROPERTIES for our new products table.
hive> ALTER TABLE products SET TBLPROPERTIES (
'details' = 'This table holds products');
Beeline’s default setting here makes the results hard to read. Let’s adjust Beeline to display data
in a different format so we can see all of the output.
Inside of Beeline:
hive> !set outputformat vertical
Hands-on-Lab Page 17
IBM Software
That is a lot of details! Notice there is some interesting info including the location of this table
within HDFS: /apps/hive/warehouse/computersalesdb.db/products
__5. Let’s verify that the products directory was created on HDFS in the location listed above. Run the
HDFS ls command from within a Linux console. First list the contents of the database directory,
then list the contents of the products table directory.
The first command confirms that there is in fact a products table directory on HDFS. The second
command shows that there are no files within the products directory yet. This directory will be
empty until we load data into the products table in a later exercise.
__6. Imagine that our fictitious computer company adds sales data to a “sales_staging” table at the
end of each month. From this sales_staging table they then move the data they want to analyze
into a partitioned “sales” table. The partitioned sales table is the one they actual use for their
analysis.
Now that we know how to create tables, we will create one more managed non-partitioned table
called “sales_staging”. This table will hold ALL of the sales data from the sales.csv file. In later
exercises we will actually split this sales_staging data into a partitioned table called “sales”.
__7. We can now assume that the new sales_staging table directory is on HDFS in the following
folder: /apps/hive/warehouse/computersalesdb.db/sales_staging. Let’s quickly confirm by
entering the following command in the Linux console:
Sure enough, the sales_staging directory was created and is now being managed by Hive.
__8. Ask Hive to show us the tables in our database. Confirm your new sales_staging table is in the
Hive catalog.
Hands-on-Lab Page 19
IBM Software
__9. Let’s pretend that we have decided we want to update some of our column metadata. We will
change the sale_date column in the sales_staging table from a STRING type to a DATE type.
hive> ALTER TABLE sales_staging CHANGE sale_date sale_date DATE;
__1. Now we will create a partitioned table. This table will be a managed table – Hive will manage the
metadata and lifecycle of this table, just like the tables we previously created.
In the CLI create the sales table. This table will be partitioned on the sales date.
hive> CREATE TABLE sales
(
cust_id STRING,
prod_num STRING,
qty INT,
sales_id STRING
)
COMMENT 'Table for analysis of sales data'
PARTITIONED BY (sales_date STRING)
ROW FORMAT DELIMITED
Notice that we list the sales_date in the PARTITIONED BY clause instead of listing it in the data
column metadata. Since we are partitioning on sales_date, Hive will keep track of the dates for
us outside of the actual data.
When we later put data into this table, a new directory will be created inside the sales directory
for EACH partition.
The following line in the details shows us how our table is partitioned: partitionKeys:
[FieldSchema(name:sales_date, typestring,comment:null)]
Another department in our fictitious computer company would like to be able to analyze the customer
data. It therefore makes sense that we setup the customer table as EXTERNAL so they can use their
tools on the data and we can use ours (Hive). We will place a copy of the Customer.csv file in HDFS and
then create a new table in Hive that points to this data.
__1. First we need to create a new directory – let’s call it “shared_hive_data” - on HDFS that can
house our Customer.csv data file. Let’s put this in the /tmp directory. We will run the command to
make the new directory from the Linux console. Open the Linux console and enter the following:
Hands-on-Lab Page 21
IBM Software
__2. Now we will move a copy of the Customer.csv file into the /tmp/shared_hive_data directory. We
can run the command to do this from within the Linux console.
$ hadoop fs -put
/home/virtuser/sampleData/Computer_Business/WithoutHeaders/Customer.csv
/tmp/shared_hive_data/Customer.csv;
__3. Confirm that Customer.csv has been copied successfully into HDFS. Enter the following
command in the Linux console.
If your output looks similar to the screen capture above, then that is good!
If you’d like, run the “cat” command to verify the data is in the Customer.csv file on HDFS.
__4. Now we just need to define our external customer table. Go back to the Hive Beeline console
and create the table.
customer_id STRING,
city_zip STRUCT<city:STRING, zip:STRING>
)
COMMENT 'External table for customer data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LOCATION '/tmp/shared_hive_data/';
There are a few things to note here. First, we use the EXTERNAL keyword in the CREATE line.
We are leaving out the “stored as line” when we are creating this external table since the default
format is already set to TEXTFILE. We also add the LOCATION ‘location/of/datadirectory’ line to
the end of the statement.
Hive expects that LOCATION will be a directory, not a file. For our exercises we will only have a
single file in the shared_hive_data directory. However, you could put multiple customer data files
into the shared_hive_data directory and Hive would use them all for your EXTERNAL table! That
is a common scenario.
You can see that the location points to the /tmp/shared_hive_data directory we designated on
HDFS. Also notice towards the end of the output that tableType:EXTERNAL_TABLE.
__6. Let’s set the Beeline output format to table style. This will give us cleaner output when we run
queries.
hive> !set outputformat table
Hands-on-Lab Page 23
IBM Software
__7. Since customer is an External table and Hive already knows where the data is sitting, you can
already begin to run queries on this table. Reward yourself for completing this lab by running a
simple select query as proof.
hive> SELECT * FROM customer LIMIT 5;
This Hive query didn’t run any MapReduce jobs. Hive is able to read the data file and write the
results to the CLI without using MapReduce, since this is a simple SELECT and LIMIT
statement.
1.6 Summary
Congratulations! You now know how to create, alter and remove databases in Hive. You can create
managed, external, and partitioned Hive tables. You also are familiar with the sample data that will be
used in this course. You may move on to the next Unit.