Vous êtes sur la page 1sur 77

Apache Hive

The Apache HiveTM data warehouse software facilitates querying and managing large datasets residing in
distributed storage. Built on top of Apache HadoopTM , it provides

 Tools to enable easy data extract/transform/load (ETL)


 A mechanism to impose structure on a variety of data formats
 Access to files stored either directly in Apache HDFSTM or in other data storage systems such
as Apache HBaseTM
 Query execution via MapReduce

Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query
the data. At the same time, this language also allows programmers who are familiar with the MapReduce
framework to be able to plug in their custom mappers and reducers to perform more sophisticated
analysis that may not be supported by the built-in capabilities of the language. QL can also be extended
with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's).

Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works
equally well on Thrift, control delimited, or your specialized data formats. Please see File
Format and SerDe in the Developer Guide for details.

Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is
best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are
scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with
MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.

General Information about Hive


 Getting Started
 Presentations and Papers about Hive
 A List of Sites and Applications Powered by Hive
 FAQ
 hive-users Mailing List
 Hive IRC Channel: #hive on irc.freenode.net
 About This Wiki

User Documentation
 Hive Tutorial
 HiveQL Language Manual (Queries, DML, DDL, and CLI)
 Hive Operators and Functions
 Hive Web Interface
 Hive Client (JDBC, ODBC, Thrift, etc)
 HiveServer2 Client
 Hive Change Log
 Avro SerDe

Administrator Documentation
 Installing Hive
 Configuring Hive
 Setting Up Metastore
 Setting Up Hive Web Interface
 Setting Up Hive Server (JDBC, ODBC, Thrift, etc.)
 Hive on Amazon Web Services
 Hive on Amazon Elastic MapReduce

Resources for Contributors


 Hive Developer FAQ
 How to Contribute
 Hive Contributors Meetings
 Hive Developer Guide
 Plugin Developer Kit
 Unit Test Parallel Execution
 Hive Performance
 Hive Architecture Overview
 Hive Design Docs
 Roadmap/Call to Add More Features
 Full-Text Search over All Hive Resources
 Becoming a Committer
 How to Commit
 How to Release
 Build Status on Jenkins (Formerly Hudson)
 Project Bylaws

For more information, please see the official Hive website.

Apache Hive, Apache Hadoop, Apache HBase, Apache HDFS, Apache, the Apache feather logo, and the
Apache Hive project logo are trademarks of The Apache Software Foundation.

Child Pages (13)


Hide Child Pages | Reorder Pages
Page: AboutThisWiki
Page: AvroSerDe
Page: Bylaws
Page: Dependent Tables
Page: Hadoop-compatible Input-Output Format for Hive
Page: HiveAmazonElasticMapReduce
Page: HiveAwsEmr
Page: HiveChangeLog
Page: HiveDeveloperFAQ
Page: HiveServer2 Clients
Page: OperatorsAndFunctions
Page: PluginDeveloperKit
Page: RCFileCat

Table of Contents

 Installation and Configuration


 Requirements
 Installing Hive from a Stable Release
 Building Hive from Source
 Compile hive on hadoop 23
 Running Hive
 Configuration management overview
 Runtime configuration
 Hive, Map-Reduce and Local-Mode
 Error Logs
 DDL Operations
 Metadata Store
 DML Operations
 SQL Operations
 Example Queries
 SELECTS and FILTERS
 GROUP BY
 JOIN
 MULTITABLE INSERT
 STREAMING
 Simple Example Use Cases
 MovieLens User Ratings
 Apache Weblog Data

DISCLAIMER: Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now -
although it may very well work on other similar platforms. It does not work on Cygwin.
Most of our testing has been on Hadoop 0.20 - so we advise running it against this version even though it
may compile/work against other versions

Installation and Configuration

Requirements
 Java 1.6
 Hadoop 0.20.x.
Installing Hive from a Stable Release
Start by downloading the most recent stable release of Hive from one of the Apache download mirrors
(see Hive Releases).

Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z:
$ tar -xzvf hive-x.y.z.tar.gz

Set the environment variable HIVE_HOME to point to the installation directory:


$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}

Finally, add $HIVE_HOME/bin to your PATH:


$ export PATH=$HIVE_HOME/bin:$PATH

Building Hive from Source


The Hive SVN repository is located here: http://svn.apache.org/repos/asf/hive/trunk
$ svn co http://svn.apache.org/repos/asf/hive/trunk hive
$ cd hive
$ ant clean package
$ cd build/dist
$ ls
README.txt
bin/ (all the shell scripts)
lib/ (required jar files)
conf/ (configuration files)
examples/ (sample input and query files)

In the rest of the page, we use build/dist and <install-dir> interchangeably.

Compile hive on hadoop 23


$ svn co http://svn.apache.org/repos/asf/hive/trunk hive
$ cd hive
$ ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 -
Dhadoop.mr.rev=23
$ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-
0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23

Running Hive
Hive uses hadoop that means:

 you must have hadoop in your path OR


 export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse


(aka hive.metastore.warehouse.dir) and set them chmod g+w in
HDFS before a table can be created in Hive.
Commands to perform this setup
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

I also find it useful but not necessary to set HIVE_HOME


$ export HIVE_HOME=<hive-install-dir>

To use hive command line interface (cli) from the shell:


$ $HIVE_HOME/bin/hive

Configuration management overview


 Hive default configuration is stored in <install-dir>/conf/hive-default.xml
Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-
site.xml
 The location of the Hive configuration directory can be changed by setting
the HIVE_CONF_DIR environment variable.
 Log4j configuration is stored in <install-dir>/conf/hive-log4j.properties
 Hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables
are inherited by default.
 Hive configuration can be manipulated by:
 Editing hive-site.xml and defining any desired variables (including hadoop variables) in it
 From the cli using the set command (see below)
 By invoking hive using the syntax:
 $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2
this sets the variables x1 and x2 to y1 and y2 respectively
 By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does
the same as above

Runtime configuration
 Hive queries are executed using map-reduce queries and, therefore, the behavior
of such queries can be controlled by the hadoop configuration variables.
 The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For
example:
hive> SET mapred.job.tracker=myhost.mycompany.com:50030;
hive> SET -v;

The latter shows all the current settings. Without the -v option only the
variables that differ from the base hadoop configuration are displayed

Hive, Map-Reduce and Local-Mode


Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-
Reduce cluster indicated by the variable:
mapred.job.tracker
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to
run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small
data sets - in such cases local mode execution is usually significantly faster than submitting jobs to a
large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one
reducer and can be very slow processing larger data sets.

Starting v-0.7, Hive fully supports local mode execution. To enable this, the user can enable the following
option:
hive> SET mapred.job.tracker=local;

In addition, mapred.local.dir should point to a path that's valid on the local machine (for
example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating
local disk space).

Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The
relevant options are:
hive> SET hive.exec.mode.local.auto=false;

note that this feature is disabled by default. If enabled - Hive analyzes the size of each map-reduce job in
a query and may run it locally if the following thresholds are satisfied:

 The total input size of the job is lower


than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
 The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by
default)
 The total number of reduce tasks required is 1 or 0.

So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to
subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run
locally.

Note that there may be differences in the runtime environment of hadoop server nodes and the machine
running the hive client (because of different jvm versions or different software libraries). This can cause
unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a
separate, child jvm (of the hive client). If the user so wishes, the maximum amount of memory for this
child jvm can be controlled via the option hive.mapred.local.mem. By default, it's set to zero, in which
case Hive lets Hadoop determine the default memory limits of the child jvm.

Error Logs
Hive uses log4j for logging. By default logs are not emitted to the
console by the CLI. The default logging level is WARN and the logs are stored in the folder:

 /tmp/<user.name>/hive.log

If the user wishes - the logs can be emitted to the console by adding
the arguments shown below:

 bin/hive -hiveconf hive.root.logger=INFO,console

Alternatively, the user can change the logging level only by using:

 bin/hive -hiveconf hive.root.logger=INFO,DRFA


Note that setting hive.root.logger via the 'set' command does not
change logging properties since they are determined at initialization time.

Hive also stores query logs on a per hive session basis in /tmp/<user.name>/, but can be configured
in hive-site.xml with the hive.querylog.location property.

Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually
Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the
task was executed. The log files can be obtained by clicking through to the Task Details page from the
Hadoop JobTracker Web UI.

When using local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are
produced on the client machine itself. Starting v-0.6 - Hive uses the hive-exec-
log4j.properties (falling back to hive-log4j.properties only if it's missing) to determine where
these logs are delivered by default. The default configuration file produces one log file per query executed
in local mode and stores it under /tmp/<user.name>. The intent of providing a separate configuration
file is to enable administrators to centralize execution log capture if desired (on a NFS file server for
example). Execution logs are invaluable for debugging run-time errors.

Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!)
to hive-dev@hadoop.apache.org.

DDL Operations
Creating Hive tables and browsing through them
hive> CREATE TABLE pokes (foo INT, bar STRING);

Creates a table called pokes with two columns, the first being an integer and the other a string
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds
STRING);

Creates a table called invites with two columns and a partition column
called ds. The partition column is a virtual column. It is not part
of the data itself but is derived from the partition that a
particular dataset is loaded into.

By default, tables are assumed to be of text input format and the


delimiters are assumed to be ^A(ctrl-a).
hive> SHOW TABLES;

lists all the tables


hive> SHOW TABLES '.*s';

lists all the table that end with 's'. The pattern matching follows Java regular
expressions. Check out this link for
documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
hive> DESCRIBE invites;

shows the list of columns

As for altering tables, table names can be changed and additional columns can be dropped:
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
hive> ALTER TABLE events RENAME TO 3koobecaf;

Dropping tables:
hive> DROP TABLE pokes;

Metadata Store
Metadata is in an embedded Derby database whose disk storage location is determined by the
hive configuration variable named javax.jdo.option.ConnectionURL. By default
(see conf/hive-default.xml), this location is ./metastore_db

Right now, in the default configuration, this metadata can only be seen by
one user at a time.

Metastore can be stored in any database that is supported by JPOX. The


location and the type of the RDBMS can be controlled by the two variables
javax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName.
Refer to JDO (or JPOX) documentation for more details on supported databases.
The database schema is defined in JDO metadata annotations file package.jdo
at src/contrib/hive/metastore/src/model.

In the future, the metastore itself can be a standalone server.

If you want to run the metastore as a network server so it can be accessed


from multiple nodes try HiveDerbyServerMode.

DML Operations
Loading data from flat files into Hive:
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO
TABLE pokes;

Loads a file that contains two columns separated by ctrl-a into pokes table.
'local' signifies that the input file is on the local file system. If 'local'
is omitted then it looks for the file in HDFS.

The keyword 'overwrite' signifies that existing data in the table is deleted.
If the 'overwrite' keyword is omitted, data files are appended to existing data sets.

NOTES:

 NO verification of data against the schema is performed by the load command.


 If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
The root of the Hive directory is specified by the option hive.metastore.warehouse.dir
in hive-default.xml. We advise users to create this directory before
trying to create tables via Hive.
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO
TABLE invites PARTITION (ds='2008-08-15');
hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO
TABLE invites PARTITION (ds='2008-08-08');
The two LOAD statements above load data into two different partitions of the table
invites. Table invites must be created as partitioned by the key ds for this to succeed.
hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites
PARTITION (ds='2008-08-15');

The above command will load data from an HDFS file/directory to the table.
Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is
almost instantaneous.

SQL Operations

Example Queries
Some example queries are shown below. They are available in build/dist/examples/queries.
More are available in the hive sources at ql/src/test/queries/positive

SELECTS and FILTERS


hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

selects column 'foo' from all rows of partition ds=2008-08-15 of the invites table. The results are not
stored anywhere, but are displayed on the console.

Note that in all the examples that follow, INSERT (into a hive table, local
directory or HDFS directory) is optional.
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a
WHERE a.ds='2008-08-15';

selects all rows from partition ds=2008-08-15 of the invites table into an HDFS directory. The result
data
is in files (depending on the number of mappers) in that directory.
NOTE: partition columns if any are selected by the use of *. They can also
be specified in the projection clauses.
Partitioned tables must always have a partition selected in the WHERE clause of the statement.
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM
pokes a;

Selects all rows from pokes table into a local directory


hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key
< 100;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events
a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes
FROM profiles a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites
a WHERE a.ds='2008-08-15';
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM
invites a;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1
a;

Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't
include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

GROUP BY
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*)
WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a
WHERE a.foo > 0 GROUP BY a.bar;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place
of COUNT(*).

JOIN
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE
TABLE events SELECT t1.bar, t1.foo, t2.foo;

MULTITABLE INSERT
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100
and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT
src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE
src.key >= 300;

STREAMING
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo,
a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

This streams the data in the map phase through the script /bin/cat (like hadoop streaming).
Similarly - streaming can be used on the reduce side (please see the Hive Tutorial for examples)

Simple Example Use Cases

MovieLens User Ratings


First, create a table with tab-delimited text file format:
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

Then, download and extract the data files:


wget http://www.grouplens.org/system/files/ml-data.tar+0.gz
tar xvzf ml-data.tar+0.gz

And load it into the table that was just created:


LOAD DATA LOCAL INPATH 'ml-data/u.data'
OVERWRITE INTO TABLE u_data;

Count the number of rows in table u_data:


SELECT COUNT(*) FROM u_data;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of
COUNT(*).
Now we can do some complex data analysis on the table u_data:

Create weekday_mapper.py:
import sys
import datetime

for line in sys.stdin:


line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])

Use the mapper script:


CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new


SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)


FROM u_data_new
GROUP BY weekday;

Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).
Apache Weblog Data
The format of Apache weblog is customizable, while most webmasters uses the default.
For default Apache weblog, we can create a table with the following command.

More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662


add jar ../build/contrib/hive_contrib.jar;

CREATE TABLE apachelog (


host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\")
(-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;

Hive User FAQ


 Hive User FAQ
 I see errors like: Server access Error: Connection timed out url=
http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz
 How to change the warehouse.dir location for older tables?
 When running a JOIN query, I see out-of-memory errors.
 I am using MySQL as metastore and I see errors:
"com.mysql.jdbc.exceptions.jdbc4.!CommunicationsException: Communications
link failure"
 Does Hive support Unicode?
 HiveQL
 Are HiveQL identifiers (e.g. table names, column names, etc) case sensitive?
 What are the maximum allowed lengths for HiveQL identifiers?
 Importing Data into Hive
 How do I import XML data into Hive?
 How do I import CSV data into Hive?
 How do I import JSON data into Hive?
 How do I import Thrift data into Hive?
 How do I import Avro data into Hive?
 How do I import delimited text data into Hive?
 How do I import fixed-width data into Hive?
 How do I import ASCII logfiles (HTTP, etc) into Hive?
 Exporting Data from Hive
 Hive Data Model
 What is the difference between a native table and an external table?
 What are dynamic partitions?
 Can a Hive table contain data in more than one format?
 Is it possible to set the data format on a per-partition basis?
 JDBC Driver
 Does Hive have a JDBC Driver?
 ODBC Driver
 Does Hive have an ODBC driver?

I see errors like: Server access Error: Connection timed out


url=http://archive.apache.org/dist/hadoop/core/hadoop-
0.20.1/hadoop-0.20.1.tar.gz
Run the following commands:
cd ~/.ant/cache/hadoop/core/sources
wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-
0.20.1.tar.gz

How to change the warehouse.dir location for older tables?


To change the base location of the Hive tables, edit the hive.metastore.warehouse.dir param. This will not
affect the older tables. Metadata needs to be changed in the database (MySQL or Derby). The location of
Hive tables is in table SDS and column LOCATION.

When running a JOIN query, I see out-of-memory errors.


This is usually caused by the order of JOIN tables. Instead of "FROM tableA a JOIN tableB b ON ...", try
"FROM tableB b JOIN tableA a ON ...". NOTE that if you are using LEFT OUTER JOIN, you might want to
change to RIGHT OUTER JOIN. This trick usually solve the problem - the rule of thumb is, always put the
table with a lot of rows having the same value in the join key on the rightmost side of the JOIN.

I am using MySQL as metastore and I see errors:


"com.mysql.jdbc.exceptions.jdbc4.!CommunicationsException:
Communications link failure"
This is usually caused by MySQL servers closing connections after the connection is idling for some time.
Run the following command on the MySQL server will solve the problem "set global wait_status=120;"

1. When using MySQL as a metastore I see the error


"com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Specified key was too long; max key
length is 767 bytes".
This is a known limitation of MySQL 5.0 and UTF8 databases. One option is to use another
character set, such as 'latin1', which is known to work.

Does Hive support Unicode?

HiveQL

Are HiveQL identifiers (e.g. table names, column names, etc)


case sensitive?
No. Hive is case insensitive.

Executing:

SELECT * FROM MyTable WHERE myColumn = 3

is strictly equivalent to

select * from mytable where mycolumn = 3

What are the maximum allowed lengths for HiveQL identifiers?

Importing Data into Hive


How do I import XML data into Hive?

How do I import CSV data into Hive?

How do I import JSON data into Hive?

How do I import Thrift data into Hive?

How do I import Avro data into Hive?

How do I import delimited text data into Hive?

How do I import fixed-width data into Hive?

How do I import ASCII logfiles (HTTP, etc) into Hive?

Exporting Data from Hive

Hive Data Model

What is the difference between a native table and an external


table?

What are dynamic partitions?

Can a Hive table contain data in more than one format?

Is it possible to set the data format on a per-partition basis?

JDBC Driver
Does Hive have a JDBC Driver?
Yes. Look out to the hive-jdbc jar. The driver is 'org.apache.hadoop.hive.jdbc.HiveDriver'.

It supports two modes: a local mode and a remote one.

In the remote mode it connects to the hive server through its Thrift API. The JDBC url to use should be of
the form: 'jdbc:hive://hostname:port/databasename'

In the local mode Hive is embedded. The JDBC url to use should be 'jdbc:hive://'.

ODBC Driver

Does Hive have an ODBC driver?

Hive Tutorial

 Hive Tutorial
 Concepts
 What is Hive
 What Hive is NOT
 Data Units
 Type System
 Primitive Types
 Complex Types
 Built in operators and functions
 Built in operators
 Built in functions
 Language capabilities
 Usage and Examples
 Creating Tables
 Browsing Tables and Partitions
 Loading Data
 Simple Query
 Partition Based Query
 Joins
 Aggregations
 Multi Table/File Inserts
 Dynamic-partition Insert
 Inserting into local files
 Sampling
 Union all
 Array Operations
 Map(Associative Arrays) Operations
 Custom map/reduce scripts
 Co-Groups
 Altering Tables
 Dropping Tables and Partitions

Concepts

What is Hive
Hive is a data warehousing infrastructure based on the Hadoop. Hadoop provides massive scale out and
fault tolerance capabilities for data storage and processing (using the map-reduce programming
paradigm) on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of
data. It provides a simple query language called Hive QL, which is based on SQL and which enables
users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time,
Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and
reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the
language.

What Hive is NOT


Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial
overheads in job submission and scheduling. As a result - latency for Hive queries is generally very high
(minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it
cannot be compared with systems such as Oracle where analyses are conducted on a significantly
smaller amount of data but the analyses proceed much more iteratively with the response times between
iterations being less than a few minutes. Hive aims to provide acceptable (but not optimal) latency for
interactive data browsing, queries over small data sets or test queries.

Hive is not designed for online transaction processing and does not offer real-time queries and row level
updates. It is best used for batch jobs over large sets of immutable data (like web logs).

In the following sections we provide a tutorial on the capabilities of the system. We start by describing the
concepts of data types, tables and partitions (which are very similar to what you would find in a traditional
relational DBMS) and then illustrate the capabilities of the QL language with the help of some examples.

Data Units
In the order of granularity - Hive data is organized into:

 Databases: Namespaces that separate tables and other data units from naming confliction.
 Tables: Homogeneous units of data which have the same schema. An example of a table could
be page_views table, where each row could comprise of the following columns (schema):
 timestamp - which is of INT type that corresponds to a unix timestamp of when the page
was viewed.
 userid - which is of BIGINT type that identifies the user who viewed the page.
 page_url - which is of STRING type that captures the location of the page.
 referer_url - which is of STRING that captures the location of the page from where the
user arrived at the current page.
 IP - which is of STRING type that captures the IP address from where the page request
was made.
 Partitions: Each Table can have one or more partition Keys which determines how the data is
stored. Partitions - apart from being storage units - also allow the user to efficiently identify the
rows that satisfy a certain criteria. For example, a date_partition of type STRING and
country_partition of type STRING. Each unique value of the partition keys defines a partition of
the Table. For example all "US" data from "2009-12-23" is a partition of the page_views table.
Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only
on the relevant partition of the table thereby speeding up the analysis significantly. Note however,
that just because a partition is named 2009-12-23 does not mean that it contains all or only data
from that date; partitions are named after dates for convenience but it is the user's job to
guarantee the relationship between partition name and data content!). Partition columns are
virtual columns, they are not part of the data itself but are derived on load.
 Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the
value of a hash function of some column of the Table. For example the page_views table may be
bucketed by userid, which is one of the columns, other than the partitions columns, of the
page_view table. These can be used to efficiently sample the data.

Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the
system to prune large quantities of data during query processing, resulting in faster query execution.

Type System

Primitive Types
 Types are associated with the columns in the tables. The following Primitive types are supported:
 Integers
 TINYINT - 1 byte integer
 SMALLINT - 2 byte integer
 INT - 4 byte integer
 BIGINT - 8 byte integer
 Boolean type
 BOOLEAN - TRUE/FALSE
 Floating point numbers
 FLOAT - single precision
 DOUBLE - Double precision
 String type
 STRING - sequence of characters in a specified character set

The Types are organized in the following hierarchy (where the parent is a super type of all the children
instances):

 Type

Primitive Type

Number

DOUBLE

BIGINT

INT

TINYINT

FLOAT

INT

TINYINT

STRING

BOOLEAN

This type hierarchy defines how the types are implicitly converted in the query language. Implicit
conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and
the data is of type2 type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type
hierarchy. Apart from these fundamental rules for implicit conversion based on type system, Hive also
allows the special case for conversion:

 <STRING> to <DOUBLE>

Explicit type conversion can be done using the cast operator as shown in the Built in functions section
below.
Complex Types
Complex Types can be built up from primitive types and other composite types using:

 Structs: the elements within the type can be accessed using the DOT (.) notation. For example,
for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a
 Maps (key-value tuples): The elements are accessed using ['element name'] notation. For
example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed
using M['group']
 Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be
accessed using the [n] notation where n is an index (zero-based) into the array. For example for
an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.

Using the primitive types and the constructs for creating complex types, types with arbitrary levels of
nesting can be created. For example, a type User may comprise of the following fields:

 gender - which is a STRING.


 active - which is a BOOLEAN.

Built in operators and functions

Built in operators
 Relational Operators - The following operators compare the passed operands and generate a
TRUE or FALSE value depending on whether the comparison between the operands holds or
not.

Relational Operand Description


Operator types

A=B all TRUE if expression A is equivalent to expression B otherwise FALSE


primitive
types

A != B all TRUE if expression A is not equivalent to expression B otherwise FALSE


primitive
types

A<B all TRUE if expression A is less than expression B otherwise FALSE


primitive
types

A <= B all TRUE if expression A is less than or equal to expression B otherwise FALSE
primitive
types

A>B all TRUE if expression A is greater than expression B otherwise FALSE


primitive
types

A >= B all TRUE if expression A is greater than or equal to expression B otherwise FALSE
primitive
types

A IS NULL all types TRUE if expression A evaluates to NULL otherwise FALSE

A IS NOT all types FALSE if expression A evaluates to NULL otherwise TRUE


NULL

A LIKE B strings TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The
comparison is done character by character. The _ character in B matches any character
in A (similar to . in posix regular expressions), and the % character in B matches an
arbitrary number of characters in A (similar to .* in posix regular expressions). For
example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE
'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%'. To escape %
use \ (% matches one % character). If the data contains a semi-colon, and you want to
search for it, it needs to be escaped, columnValue LIKE 'a\;b'

A RLIKE B strings TRUE if string A matches the Java regular expression B (See Java regular expressions
syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to FALSE whereas
'foobar' rlike '^f.*r$' evaluates to TRUE

A REGEXP B strings Same as RLIKE

 Arithmetic Operators - The following operators support various common arithmetic operations
on the operands. All of them return number types.

Arithmetic Operand Description


Operators types

A+B all number Gives the result of adding A and B. The type of the result is the same as the common
types parent(in the type hierarchy) of the types of the operands. e.g. since every integer is a
float, therefore float is a containing type of integer so the + operator on a float and an
int will result in a float.

A-B all number Gives the result of subtracting B from A. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.

A*B all number Gives the result of multiplying A and B. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands. Note that if the
multiplication causing overflow, you will have to cast one of the operators to a type
higher in the type hierarchy.

A/B all number Gives the result of dividing B from A. The type of the result is the same as the common
types parent(in the type hierarchy) of the types of the operands. If the operands are integer
types, then the result is the quotient of the division.

A%B all number Gives the reminder resulting from dividing A by B. The type of the result is the same as
types the common parent(in the type hierarchy) of the types of the operands.

A&B all number Gives the result of bitwise AND of A and B. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.

A|B all number Gives the result of bitwise OR of A and B. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.

A^B all number Gives the result of bitwise XOR of A and B. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.

~A all number Gives the result of bitwise NOT of A. The type of the result is the same as the type of
types A.

 Logical Operators - The following operators provide support for creating logical expressions. All
of them return boolean TRUE or FALSE depending upon the boolean values of the operands.

Logical Operators Operands types Description

A AND B boolean TRUE if both A and B are TRUE, otherwise FALSE

A && B boolean Same as A AND B

A OR B boolean TRUE if either A or B or both are TRUE, otherwise FALSE

A | B boolean Same as A OR B

NOT A boolean TRUE if A is FALSE, otherwise FALSE

!A boolean Same as NOT A

 Operators on Complex Types - The following operators provide mechanisms to access


elements in Complex Types

Operator Operand types Description

A[n] A is an Array and n is returns the nth element in the array A. The first element has index 0 e.g. if A is
an int an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'

M[key] M is a Map<K, V> returns the value corresponding to the key in the map e.g. if M is a map
and key has type K comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns
'foobar'

S.x S is a struct returns the x field of S e.g for struct foobar {int foo, int bar} foobar.foo returns
the integer stored in the foo field of the struct.

Built in functions
 The following built in functions are supported in hive:
(Function list in source code: FunctionRegistry.java)

Return Function Name (Signature) Description


Type

BIGINT round(double a) returns the rounded BIGINT value of the double

BIGINT floor(double a) returns the maximum BIGINT value that is equal or less than the double

BIGINT ceil(double a) returns the minimum BIGINT value that is equal or greater than the
double

double rand(), rand(int seed) returns a random number (that changes from row to row). Specifiying
the seed will make sure the generated random number sequence is
deterministic.

string concat(string A, string B,...) returns the string resulting from concatenating B after A. For example,
concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary
number of arguments and return the concatenation of all of them.

string substr(string A, int start) returns the substring of A starting from start position till the end of string
A. For example, substr('foobar', 4) results in 'bar'

string substr(string A, int start, int returns the substring of A starting from start position with the given
length) length e.g. substr('foobar', 4, 2) results in 'ba'

string upper(string A) returns the string resulting from converting all characters of A to upper
case e.g. upper('fOoBaR') results in 'FOOBAR'

string ucase(string A) Same as upper

string lower(string A) returns the string resulting from converting all characters of B to lower
case e.g. lower('fOoBaR') results in 'foobar'

string lcase(string A) Same as lower

string trim(string A) returns the string resulting from trimming spaces from both ends of A
e.g. trim(' foobar ') results in 'foobar'

string ltrim(string A) returns the string resulting from trimming spaces from the beginning(left
hand side) of A. For example, ltrim(' foobar ') results in 'foobar '

string rtrim(string A) returns the string resulting from trimming spaces from the end(right
hand side) of A. For example, rtrim(' foobar ') results in ' foobar'

string regexp_replace(string A, returns the string resulting from replacing all substrings in B that match
string B, string C) the Java regular expression syntax(See Java regular expressions syntax)
with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb'

int size(Map<K.V>) returns the number of elements in the map type

int size(Array<T>) returns the number of elements in the array type

value of cast(<expr> as <type>) converts the results of the expression expr to <type> e.g. cast('1' as
<type> BIGINT) will convert the string '1' to it integral representation. A null is
returned if the conversion does not succeed.

string from_unixtime(int unixtime) convert the number of seconds from unix epoch (1970-01-01 00:00:00
UTC) to a string representing the timestamp of that moment in the
current system time zone in the format of "1970-01-01 00:00:00"

string to_date(string timestamp) Return the date part of a timestamp string: to_date("1970-01-01
00:00:00") = "1970-01-01"

int year(string date) Return the year part of a date or a timestamp string: year("1970-01-01
00:00:00") = 1970, year("1970-01-01") = 1970

int month(string date) Return the month part of a date or a timestamp string: month("1970-11-
01 00:00:00") = 11, month("1970-11-01") = 11

int day(string date) Return the day part of a date or a timestamp string: day("1970-11-01
00:00:00") = 1, day("1970-11-01") = 1

string get_json_object(string Extract json object from a json string based on json path specified, and
json_string, string path) return json string of the extracted json object. It will return null if the
input json string is invalid

 The following built in aggregate functions are supported in Hive:

Return Aggregation Function Description


Type Name (Signature)

BIGINT count(*), count(expr), count(*) - Returns the total number of retrieved rows, including rows
count(DISTINCT expr[, containing NULL values; count(expr) - Returns the number of rows for which
expr_.]) the supplied expression is non-NULL; count(DISTINCT expr[, expr]) - Returns
the number of rows for which the supplied expression(s) are unique and non-
NULL.
DOUBLE sum(col), returns the sum of the elements in the group or the sum of the distinct values
sum(DISTINCT col) of the column in the group

DOUBLE avg(col), avg(DISTINCT returns the average of the elements in the group or the average of the distinct
col) values of the column in the group

DOUBLE min(col) returns the minimum value of the column in the group

DOUBLE max(col) returns the maximum value of the column in the group

Language capabilities
Hive query language provides the basic SQL like operations. These operations work on tables or
partitions. These operations are:

 Ability to filter rows from a table using a where clause.


 Ability to select certain columns from the table using a select clause.
 Ability to do equi-joins between two tables.
 Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table.
 Ability to store the results of a query into another table.
 Ability to download the contents of a table to a local (e.g., nfs) directory.
 Ability to store the results of a query in a hadoop dfs directory.
 Ability to manage tables and partitions (create, drop and alter).
 Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.

Usage and Examples


The following examples highlight some salient features of the system. A detailed set of query test cases
can be found at Hive Query Test Cases and the corresponding results can be found at Query Test Case
Results

Creating Tables
An example statement that would create the page_view table mentioned above would be like:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;

In this example the columns of the table are specified with the corresponding types. Comments can be
attached both at the column level as well as at the table level. Additionally the partitioned by clause
defines the partitioning columns which are different from the data columns and are actually not stored
with the data. When specified in this way, the data in the files is assumed to be delimited with ASCII
001(ctrl-A) as the field delimiter and newline as the row delimiter.

The field delimiter can be parametrized if the data is not in the above format as illustrated in the following
example:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;

The row deliminator currently cannot be changed since it is not determined by Hive but Hadoop. e
delimiters.

It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be
executed against the data set. If bucketing is absent, random sampling can still be done on the table but it
is not efficient as the query has to scan all the data. The following example illustrates the case of the
page_view table that is bucketed on the userid column:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each
bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do
efficient sampling on the clustered column - in this case userid. The sorting property allows internal
operators to take advantage of the better-known data structure while evaluating queries with greater
efficiency.
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

In this example the columns that comprise of the table row are specified in a similar way as the definition
of types. Comments can be attached both at the column level as well as at the table level. Additionally the
partitioned by clause defines the partitioning columns which are different from the data columns and are
actually not stored with the data. The CLUSTERED BY clause specifies which column to use for
bucketing as well as how many buckets to create. The delimited row format specifies how the rows are
stored in the hive table. In the case of the delimited format, this specifies how the fields are terminated,
how the items within collections (arrays or maps) are terminated and how the map keys are terminated.
STORED AS SEQUENCEFILE indicates that this data is stored in a binary format (using hadoop
SequenceFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS clauses in the
above example represent the system defaults.

Table names and column names are case insensitive.

Browsing Tables and Partitions


SHOW TABLES;

To list existing tables in the warehouse; there are many of these, likely more than you want to browse.
SHOW TABLES 'page.*';

To list tables with prefix 'page'. The pattern follows Java regular expression syntax (so the period is a
wildcard).
SHOW PARTITIONS page_view;

To list partitions of a table. If the table is not a partitioned table then an error is thrown.
DESCRIBE page_view;

To list columns and column types of table.


DESCRIBE EXTENDED page_view;

To list columns and all other properties of table. This prints lot of information and that too not in a pretty
format. Usually used for debugging.
DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');

To list columns and all other properties of a partition. This also prints lot of information which is usually
used for debugging.

Loading Data
There are multiple ways to load data into Hive tables. The user can create an external table that points to
a specified location within HDFS. In this particular usage, the user can copy a file into the specified
location using the HDFS put or copy commands and create a table pointing to this location with all the
relevant row format information. Once this is done, the user can transform the data and insert them into
any other Hive table. For example, if the file /tmp/pv_2008-06-08.txt contains comma separated page
views served on 2008-06-08, and this needs to be loaded into the page_view table in the appropriate
partition, the following sequence of commands can achieve this:
CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

FROM page_view_stg pvs


INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null,
null, pvs.ip
WHERE pvs.country = 'US';

In the example above nulls are inserted for the array and map types in the destination tables but
potentially these can also come from the external table if the proper row formats are specified.

This method is useful if there is already legacy data in HDFS on which the user wants to put some
metadata so that the data can be queried and manipulated using Hive.

Additionally, the system also supports syntax that can load the data from a file in the local files system
directly into a Hive table where the input data format is the same as the table format. If /tmp/pv_2008-06-
08_us.txt already contains the data for US, then we do not need any additional filtering as shown in the
previous example. The load in this case can be done using the following syntax:
LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view
PARTITION(date='2008-06-08', country='US')

The path argument can take a directory (in which case all the files in the directory are loaded), a single
file name, or a wildcard (in which case all the matching files are uploaded). If the argument is a directory -
it cannot contain subdirectories. Similarly - the wildcard must match file names only.

In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, the user may decide to do a parallel
load of the data (using tools that are external to Hive). Once the file is in HDFS - the following syntax can
be used to load the data into a Hive table:
LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view
PARTITION(date='2008-06-08', country='US')

It is assumed that the array and map fields in the input.txt files are null fields for these examples.

Simple Query
For all the active users, one can use the query of the following form:
INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;

Note that unlike SQL, we always insert the results into a table. We will illustrate later how the user can
inspect these results and even dump them to a local file. You can also run the following query on Hive
CLI:
SELECT user.*
FROM user
WHERE user.active = 1;

This will be internally rewritten to some temporary file and displayed to the Hive client side.
Partition Based Query
What partitions to use in a query is determined automatically by the system on the basis of where clause
conditions on partition columns. For example, in order to get all the page_views in the month of 03/2008
referred from domain xyz.com, one could write the following query:
INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
AND
page_views.referrer_url like '%xyz.com';

Note that page_views.date is used here because the table (above) was defined with PARTITIONED
BY(date DATETIME, country STRING) ; if you name your partition something different, don't expect .date
to do what you think!

Joins
In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would need to join
the page_view table and the user table on the userid column. This can be accomplished with a join as
shown in the following query:
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or FULL
OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved or both sides
preserved). For example, in order to do a full outer join in the query above, the corresponding syntax
would look like the following query:
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by
the following example.
INSERT OVERWRITE TABLE pv_users
SELECT u.*
FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order to join more than one tables, the user can use the following syntax:
INSERT OVERWRITE TABLE pv_friends
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON
(u.id = f.uid)
WHERE pv.date = '2008-03-03';
Note that Hive only supports equi-joins. Also it is best to put the largest table on the rightmost side of the
join to get the best performance.

Aggregations
In order to count the number of distinct users by gender one could write the following query:
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

Multiple aggregations can be done at the same time, however, no two aggregations can have different
DISTINCT columns .e.g while the following is possible
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*),
sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

however, the following query is not allowed


INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;

Multi Table/File Inserts


The output of the aggregations or simple selects can be further sent into multiple tables or even to
hadoop dfs files (which can then be manipulated using hdfs utilities). e.g. if along with the gender
breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that
with the following query:
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY pv_users.gender

INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'


SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY pv_users.age;

The first insert clause sends the results of the first group by to a Hive table while the second one sends
the results to a hadoop dfs files.

Dynamic-partition Insert
In the previous examples, the user has to know which partition to insert into and only one partition can be
inserted in one insert statement. If you want to load into multiple partitions, you have to use multi-insert
statement as illustrated below.
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,
null, null, pvs.ip WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,
null, null, pvs.ip WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,
null, null, pvs.ip WHERE pvs.country = 'UK';

In order to load data into all country partitions in a particular day, you have to add an insert statement for
each country in the input data. This is very inconvenient since you have to have the priori knowledge of
the list of countries exist in the input data and create the partitions beforehand. If the list changed for
another day, you have to modify your insert DML as well as the partition creation DDLs. It is also
inefficient since each insert statement may be turned into a MapReduce Job.

Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically
determining which partitions should be created and populated while scanning the input table. This is a
newly added feature that is only available from version 0.6.0. In the dynamic partition insert, the input
column values are evaluated to determine which partition this row should be inserted into. If that partition
has not been created, it will create that partition automatically. Using this feature you need only one insert
statement to create and populate all necessary partitions. In addition, since there is only one insert
statement, there is only one corresponding MapReduce job. This significantly improves performance and
reduce the Hadoop cluster workload comparing to the multiple insert case.

Below is an example of loading data to all country partitions using one insert statement:
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,
null, null, pvs.ip, pvs.country

There are several syntactic differences from the multi-insert statement:

 country appears in the PARTITION specification, but with no value associated. In this case,
country is a dynamic partition column. On the other hand, ds has a value associated with it, which
means it is a static partition column. If a column is dynamic partition column, its value will be
coming from the input column. Currently we only allow dynamic partition columns to be the last
column(s) in the partition clause because the partition column order indicates its hierarchical
order (meaning dt is the root partition, and country is the child partition). You cannot specify a
partition clause with (dt, country='US') because that means you need to update all partitions with
any date and its country sub-partition is 'US'.
 An additional pvs.country column is added in the select statement. This is the corresponding input
column for the dynamic partition column. Note that you do not need to add an input column for
the static partition column because its value is already known in the PARTITION clause. Note that
the dynamic partition values are selected by ordering, not name, and taken as the last columns
from the select clause.

Semantics of the dynamic partition insert statement:


 When there are already non-empty partitions exists for the dynamic partition columns, (e.g.,
country='CA' exists under some ds root partition), it will be overwritten if the dynamic partition
insert saw the same value (say 'CA') in the input data. This is in line with the 'insert overwrite'
semantics. However, if the partition value 'CA' does not appear in the input data, the existing
partition will not be overwritten.
 Since a Hive partition corresponds to a directory in HDFS, the partition value has to conform to
the HDFS path format (URI in Java). Any character having a special meaning in URI (e.g., '%', ':',
'/', '#') will be escaped with '%' followed by 2 bytes of its ASCII value.
 If the input column is a type different than STRING, its value will be first converted to STRING to
be used to construct the HDFS path.
 If the input column value is NULL or empty string, the row will be put into a special partition,
whose name is controlled by the hive parameter hive.exec.default.partition.name. The default
value is__HIVE_DEFAULT_PARTITION__. Basically this partition will contain all "bad" rows
whose value are not valid partition names. The caveat of this approach is that the bad value will
be lost and is replaced by__HIVE_DEFAULT_PARTITION__ if you select them Hive. JIRA HIVE-
1309 is a solution to let user specify "bad file" to retain the input partition column values as well.
 Dynamic partition insert could potentially resource hog in that it could generate a large number of
partitions in a short time. To get yourself buckled, we define three parameters:
 hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum
dynamic partitions that can be created by each mapper or reducer. If one mapper or
reducer created more than that the threshold, a fatal error will be raised from the
mapper/reducer (through counter) and the whole job will be killed.
 hive.exec.max.dynamic.partitions (default value being 1000) is the total number of
dynamic partitions could be created by one DML. If each mapper/reducer did not exceed
the limit but the total number of dynamic partitions does, then an exception is raised at
the end of the job before the intermediate data are moved to the final destination.
 hive.exec.max.created.files (default value being 100000) is the maximum total number
of files created by all mappers and reducers. This is implemented by updating a Hadoop
counter by each mapper/reducer whenever a new file is created. If the total number is
exceeding hive.exec.max.created.files, a fatal error will be thrown and the job will be
killed.
 Another situation we want to protect against dynamic partition insert is that the user may
accidentally specify all partitions to be dynamic partitions without specifying one static partition,
while the original intention is to just overwrite the sub-partitions of one root partition. We define
another parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic partition
case. In the strict mode, you have to specify at least one static partition. The default mode is
strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false to control whether
to allow dynamic partition at all. The default value is false.
 In Hive 0.6, dynamic partition insert does not work with hive.merge.mapfiles=true or
hive.merge.mapredfiles=true, so it internally turns off the merge parameters. Merging files in
dynamic partition inserts are supported in Hive 0.7 (see JIRA HIVE-1307 for details).

Troubleshooting and best practices:

 As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a
fatal error could be raised and the job will be killed. The error message looks something like:
 hive> set hive.exec.dynamic.partition.mode=nonstrict;
 hive> FROM page_view_stg pvs
 INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
 SELECT pvs.viewTime, pvs.userid, pvs.page_url,
pvs.referrer_url, null, null, pvs.ip,
 from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd')
ds, pvs.country;
 ...
 2010-05-07 11:10:19,816 Stage-1 map = 0%, reduce = 0%
 [Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job.
 Ended Job = job_201005052204_28178 with errors
 ...
The problem of this that one mapper will take a random set of rows and it is very likely that the
number of distinct (dt, country) pairs will exceed the limit of
hive.exec.max.dynamic.partitions.pernode. One way around it is to group the rows by the
dynamic partition columns in the mapper and distribute them to the reducers where the dynamic
partitions will be created. In this case the number of distinct dynamic partitions will be significantly
reduced. The above example query could be rewritten to:
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url,
pvs.referrer_url, null, null, pvs.ip,
from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd')
ds, pvs.country
DISTRIBUTE BY ds, country;
This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be
converted to a plan to the mappers and the output will be distributed to the reducers based on the
value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which
writes to the dynamic partitions.

Inserting into local files


In certain situations you would want to write the output into a local file so that you could load it
into an excel spreadsheet. This can be accomplished with the following command:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;

Sampling
The sampling clause allows the users to write queries for samples of the data instead of the whole table.
Currently the sampling is done on the columns that are specified in the CLUSTERED BY clause of the
CREATE TABLE statement. In the following example we choose 3rd bucket out of the 32 buckets of the
pv_gender_sum table:
INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);

In general the TABLESAMPLE syntax looks like:


TABLESAMPLE(BUCKET x OUT OF y)
y has to be a multiple or divisor of the number of buckets in that table as specified at the table creation
time. The buckets chosen are determined if bucket_number module y is equal to x. So in the above
example the following tablesample clause
TABLESAMPLE(BUCKET 3 OUT OF 16)

would pick out the 3rd and 19th buckets. The buckets are numbered starting from 0.

On the other hand the tablesample clause


TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid)

would pick out half of the 3rd bucket.

Union all
The language also supports union all, e.g. if we suppose there are two different tables that track which
user has published a video and which user has published a comment, the following query joins the results
of a union all with the user table to create a single annotated stream for all the video publishing and
comment publishing events:
INSERT OVERWRITE TABLE actions_users
SELECT u.id, actions.date
FROM (
SELECT av.uid AS uid
FROM action_video av
WHERE av.date = '2008-06-03'

UNION ALL

SELECT ac.uid AS uid


FROM action_comment ac
WHERE ac.date = '2008-06-03'
) actions JOIN users u ON(u.id = actions.uid);

Array Operations
Array columns in tables can only be created programmatically currently. We will be extending this soon to
be available as part of the create table statement. For the purpose of the current example assume that
pv.friends is of the type array<INT> i.e. it is an array of integers.The user can get a specific element in the
array by its index as shown in the following command:
SELECT pv.friends[2]
FROM page_views pv;

The select expressions gets the third item in the pv.friends array.

The user can also get the length of the array using the size function as shown below:
SELECT pv.userid, size(pv.friends)
FROM page_view pv;

Map(Associative Arrays) Operations


Maps provide collections similar to associative arrays. Such structures can only be created
programmatically currently. We will be extending this soon. For the purpose of the current example
assume that pv.properties is of the type map<String, String> i.e. it is an associative array from strings to
string. Accordingly, the following query:
INSERT OVERWRITE page_views_map
SELECT pv.userid, pv.properties['page type']
FROM page_views pv;

can be used to select the 'page_type' property from the page_views table.

Similar to arrays, the size function can also be used to get the number of elements in a map as shown in
the following query:
SELECT size(pv.properties)
FROM page_view pv;

Custom map/reduce scripts


Users can also plug in their own custom mappers and reducers in the data stream by using features
natively supported in the Hive language. e.g. in order to run a custom mapper script - map_script - and a
custom reducer script - reduce_script - the user can issue the following command which uses the
TRANSFORM clause to embed the mapper and the reducer scripts.

Note that columns will be transformed to string and delimited by TAB before feeding to the user script,
and the standard output of the user script will be treated as TAB-separated string columns. User scripts
can output debug information to standard error which will be shown on the task detail page on hadoop.
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
AS dt, uid
CLUSTER BY dt) map_output

INSERT OVERWRITE TABLE pv_users_reduced


REDUCE map_output.dt, map_output.uid
USING 'reduce_script'
AS date, count;

Sample map script (weekday_mapper.py )


import sys
import datetime

for line in sys.stdin:


line = line.strip()
userid, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print ','.join([userid, str(weekday)])

Of course, both MAP and REDUCE are "syntactic sugar" for the more general select transform. The inner
query could also have been written as such:
SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS
dt, uid CLUSTER BY dt FROM pv_users;
Schema-less map/reduce: If there is no "AS" clause after "USING map_script", Hive assumes the output
of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first
tab. Note that this is different from specifying "AS key, value" because in that case value will only contains
the portion between the first tab and the second tab if there are multiple tabs.

In this way, we allow users to migrate old map/reduce scripts without knowing the schema of the map
output. User still needs to know the reduce output schema because that has to match what is in the table
that we are inserting to.
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
CLUSTER BY key) map_output

INSERT OVERWRITE TABLE pv_users_reduced

REDUCE map_output.dt, map_output.uid


USING 'reduce_script'
AS date, count;

Distribute By and Sort By: Instead of specifying "cluster by", the user can specify "distribute by" and "sort
by", so the partition columns and sort columns can be different. The usual case is that the partition
columns are a prefix of sort columns, but that is not required.
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
AS c1, c2, c3
DISTRIBUTE BY c2
SORT BY c2, c1) map_output

INSERT OVERWRITE TABLE pv_users_reduced

REDUCE map_output.c1, map_output.c2, map_output.c3


USING 'reduce_script'
AS date, count;

Co-Groups
Amongst the user community using map/reduce, cogroup is a fairly common operation wherein the data
from multiple tables are sent to a custom reducer such that the rows are grouped by the values of certain
columns on the tables. With the UNION ALL operator and the CLUSTER BY specification, this can be
achieved in the Hive query language in the following way. Suppose we wanted to cogroup the rows from
the actions_video and action_comments table on the uid column and send them to the 'reduce_script'
custom reducer, the following syntax can be used by the user:
FROM (
FROM (
FROM action_video av
SELECT av.uid AS uid, av.id AS id, av.date AS date

UNION ALL
FROM action_comment ac
SELECT ac.uid AS uid, ac.id AS id, ac.date AS date
) union_actions
SELECT union_actions.uid, union_actions.id, union_actions.date
CLUSTER BY union_actions.uid) map

INSERT OVERWRITE TABLE actions_reduced


SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS
(uid, id, reduced_val);

Altering Tables
To rename existing table to a new name. If a table with new name already exists then an error is
returned:
ALTER TABLE old_table_name RENAME TO new_table_name;

To rename the columns of an existing table. Be sure to use the same column types, and to include an
entry for each preexisting column:
ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...);

To add columns to an existing table:


ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2
STRING DEFAULT 'def val');

Note that a change in the schema (such as the adding of the columns), preserves the schema for the old
partitions of the table in case it is a partitioned table. All the queries that access these columns and run
over the old partitions implicitly return a null value or the specified default values for these columns.

In the later versions we can make the behavior of assuming certain values as opposed to throwing an
error in case the column is not found in a particular partition configurable.

Dropping Tables and Partitions


Dropping tables is fairly trivial. A drop on the table would implicitly drop any indexes(this is a future
feature) that would have been built on the table. The associated command is
DROP TABLE pv_users;

To dropping a partition. Alter the table to drop the partition.


ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')
 Note that any data for this table or partitions will be dropped and may not be recoverable. *

This is the Hive Language Manual.

 Hive CLI
 Variable Substitution
 Data Types
 Data Definition Statements
 Data Manipulation Statements
 Select
 Group By
 Sort/Distribute/Cluster/Order By
 Transform and Map-Reduce Scripts
 Operators and User-Defined Functions
 XPath-specific Functions
 Joins
 Lateral View
 Union
 Sub Queries
 Sampling
 Explain
 Virtual Columns
 Locks
 Import/Export
 Configuration Properties
 Authorization
 Statistics
 Archiving

Child Pages (17)


Hide Child Pages | Reorder Pages
Page: LanguageManual Cli
Page: LanguageManual DDL
Page: LanguageManual DML
Page: LanguageManual Select
Page: LanguageManual Joins
Page: LanguageManual LateralView
Page: LanguageManual Union
Page: LanguageManual SubQueries
Page: LanguageManual Sampling
Page: LanguageManual Explain
Page: LanguageManual VirtualColumns
Page: Configuration Properties
Page: LanguageManual ImportExport
Page: LanguageManual Authorization
Page: LanguageManual Types
Page: Literals
Page: LanguageManual VariableSubstitution

Hive Operators and Functions


 Hive Plug-in Interfaces - User-Defined Functions and SerDes
 Reflect UDF
 Guide to Hive Operators and Functions
 Functions for Statistics and Data Mining
Hive Web Interface

What is the Hive Web Interface


The Hive web interface is a an alternative to using the Hive command line interface. Using the web
interface is a great way to get started with hive.

Features

Schema Browsing
An alternative to running 'show tables' or 'show extended tables' from the CLI is to use the web based
schema browser. The Hive Meta Data is presented in a hierarchical manner allowing you to start at the
database level and click to get information about tables including the SerDe, column names, and column
types.

Detached query execution


A power user issuing multiple hive queries simultaneously would have multiple CLI windows open. The
hive web interface manages the session on the web server, not from inside the CLI window. This allows a
user to start multiple queries and return to the web interface later to check the status.

No local installation
Any user with a web browser can work with Hive. This has the usual web interface benefits, In particular a
user wishing to interact with hadoop or hive requires access to many ports. A remote or VPN user would
only require access to the hive web interface running by default on 0.0.0.0 tcp/9999.

Configuration
Hive Web Interface made its first appearance in the 0.2 branch. If you have 0.2 or the SVN trunk you
already have it.

You should not need to edit the defaults for the Hive web interface. HWI uses:
<property>
<name>hive.hwi.listen.host</name>
<value>0.0.0.0</value>
<description>This is the host address the Hive Web Interface will listen
on</description>
</property>

<property>
<name>hive.hwi.listen.port</name>
<value>9999</value>
<description>This is the port the Hive Web Interface will listen
on</description>
</property>

<property>
<name>hive.hwi.war.file</name>
<value>${HIVE_HOME}/lib/hive_hwi.war</value>
<description>This is the WAR file with the jsp content for Hive Web
Interface</description>
</property>

You probably want to setup HiveDerbyServerMode to allow multiple sessions at the same time.

Start up
When initializing Hive with no arguments that CLI is invoked. Hive has an extension architecture used to
start other hive demons.
Jetty requires apache ant to start HWI. You should define ANT_LIB as an environment variable or add
that to the hive invocation.
export ANT_LIB=/opt/ant/lib
bin/hive --service hwi

Java has no direct way of demonizing. In a production environment you should create a wrapper script.
nohup bin/hive --service hwi > /dev/null 2> /dev/null &

If you want help on the service invocation or list of parameters you can add
bin/hive --service hwi --help

Authentication
Hadoop currently uses environmental properties to determine user name and group vector. Thus Hive
and Hive Web Interface can not enforce more stringent security then Hadoop can. When you first connect
to the Hive Web Interface the user is prompted for a user name and groups. This feature was added to
support installations using different schedulers.

If you want to tighten up security you are going to need to patch the source Hive Session Manager or you
may be able to tweak the JSP to accomplish this.

Accessing
In order to access the Hive Web Interface, go to <Hive Server Address>:9999/hwi on your web browser.

Tips and tricks


Result file
The result file is local to the web server. A query that produces massive output should set the result file to
/dev/null.

Debug Mode
The debug mode is used when the user is interested in having the result file not only contain the result of
the hive query but the other messages.

Set Processor
In the CLI a command like 'SET x=5' is not processed by the the Query Processor it is processed by
the Set Processor. Use the form 'x=5' not 'set x=5'

Walk through

Authorize
Unable to render embedded object: File (1_hwi_authorize.png) not found.
Unable to render embedded object: File (2_hwi_authorize.png) not found.

Schema Browser
Unable to render embedded object: File (3_schema_table.png) not found.
Unable to render embedded object: File (4_schema_browser.png) not found.

Diagnostics
Unable to render embedded object: File (5_diagnostic.png) not found.

Running a query
Unable to render embedded object: File (6_newsession.png) not found.
Unable to render embedded object: File (7_session_runquery.png) not found.
Unable to render embedded object: File (8_session_query_1.png) not found.
Unable to render embedded object: File (9_file_view.png) not found.

 Command Line
 JDBC
 JDBC Client Sample Code
 Running the JDBC Sample Code
 JDBC Client Setup for a Secure Cluster
 Python
 PHP
 Thrift Java Client
 ODBC
 Thrift C++ Client

This page describes the different clients supported by Hive. The command line client currently only
supports an embedded server. The JDBC and thrift-java clients support both embedded and standalone
servers. Clients in other languages only support standalone servers. For details about the standalone
server see Hive Server.

Command Line
Operates in embedded mode only, i.e., it needs to have access to the hive libraries. For more details
see Getting Started.

JDBC
For embedded mode, uri is just "jdbc:hive://". For standalone server, uri is "jdbc:hive://host:port/dbname"
where host and port are determined by where the hive server is run. For example,
"jdbc:hive://localhost:10000/default". Currently, the only dbname supported is "default".

JDBC Client Sample Code


import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveJdbcClient {


private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";

/**
* @param args
* @throws SQLException
*/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + tableName);
ResultSet res = stmt.executeQuery("create table " + tableName + " (key
int, value string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "\t" + res.getString(2));
}

// load data into table


// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
String filepath = "/tmp/a.txt";
sql = "load data local inpath '" + filepath + "' into table " +
tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);

// select * query
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(String.valueOf(res.getInt(1)) + "\t" +
res.getString(2));
}

// regular hive query


sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
}
}
}

Running the JDBC Sample Code


# Then on the command-line
$ javac HiveJdbcClient.java

# To run the program in standalone mode, we need the following jars in the
classpath
# from hive/build/dist/lib
# hive_exec.jar
# hive_jdbc.jar
# hive_metastore.jar
# hive_service.jar
# libfb303.jar
# log4j-1.2.15.jar
#
# from hadoop/build
# hadoop-*-core.jar
#
# To run the program in embedded mode, we need the following additional jars
in the classpath
# from hive/build/dist/lib
# antlr-runtime-3.0.1.jar
# derby.jar
# jdo2-api-2.1.jar
# jpox-core-1.2.2.jar
# jpox-rdbms-1.2.2.jar
#
# as well as hive/build/dist/conf

$ java -cp $CLASSPATH HiveJdbcClient

# Alternatively, you can run the following bash script, which will seed the
data file
# and build your classpath before invoking the client.

#!/bin/bash
HADOOP_HOME=/your/path/to/hadoop
HIVE_HOME=/your/path/to/hive

echo -e '1\x01foo' > /tmp/a.txt


echo -e '2\x01bar' >> /tmp/a.txt

HADOOP_CORE={{ls $HADOOP_HOME/hadoop-*-core.jar}}
CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf

for i in ${HIVE_HOME}/lib/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done

java -cp $CLASSPATH HiveJdbcClient

JDBC Client Setup for a Secure Cluster


To configure Hive on a secure cluster, add the directory containing hive-site.xml to the CLASSPATH of
the JDBC client.

Python
Operates only on a standalone server. Set (and export) PYTHONPATH to build/dist/lib/py.

The python modules imported in the code below are generated by building hive.
Please note that the generated python module names have changed in hive trunk.
#!/usr/bin/env python

import sys

from hive import ThriftHive


from hive.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

try:
transport = TSocket.TSocket('localhost', 10000)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)

client = ThriftHive.Client(protocol)
transport.open()

client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)")


client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r")
client.execute("SELECT * FROM r")
while (1):
row = client.fetchOne()
if (row == None):
break
print row
client.execute("SELECT * FROM r")
print client.fetchAll()

transport.close()

except Thrift.TException, tx:


print '%s' % (tx.message)

PHP
Operates only on a standalone server.
<?php
// set THRIFT_ROOT to php directory of the hive distribution
$GLOBALS['THRIFT_ROOT'] = '/lib/php/';
// load the required files for connecting to Hive
require_once $GLOBALS['THRIFT_ROOT'] .
'packages/hive_service/ThriftHive.php';
require_once $GLOBALS['THRIFT_ROOT'] . 'transport/TSocket.php';
require_once $GLOBALS['THRIFT_ROOT'] . 'protocol/TBinaryProtocol.php';
// Set up the transport/protocol/client
$transport = new TSocket('localhost', 10000);
$protocol = new TBinaryProtocol($transport);
$client = new ThriftHiveClient($protocol);
$transport->open();
// run queries, metadata calls etc
$client->execute('SELECT * from src');
var_dump($client->fetchAll());
$transport->close();

Thrift Java Client


Operates both in embedded mode and on standalone server.

ODBC
Operates only on a standalone server. See Hive ODBC.

Thrift C++ Client


Operates only on a standalone server. In the works.

 Beeline - New Command Line shell


 JDBC
 JDBC Client Sample Code
 Running the JDBC Sample Code
 JDBC Client Setup for a Secure Cluster

This page describes the different clients supported by HiveServer2.

Beeline - New Command Line shell


HiveServer2 supports a new command shell Beeline that works with HiveServer2. Its a JDBC client that is
based on SQLLine CLI (http://sqlline.sourceforge.net/). There’s an detailed documentation of the
SQLLine which is applicable to Beeline as well. The Beeline shell works in the both embedded as well as
remote mode. In the embedded mode, it runs an embedded Hive (similar to Hive CLI) where are remote
mode is for connecting to a separate HiveServer2 process over Thrift.

Example -
% bin/beeline
Hive version 0.11.0-SNAPSHOT by Apache
beeline> !connect jdbc:hive2://localhost:10000 scott tiger
org.apache.hive.jdbc.HiveDriver
!connect jdbc:hive2://localhost:10000 scott tiger
org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://localhost:10000
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-SNAPSHOT)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> show tables;
show tables;
+-------------------+
| tab_name |
+-------------------+
| primitives |
| src |
| src1 |
| src_json |
| src_sequencefile |
| src_thrift |
| srcbucket |
| srcbucket2 |
| srcpart |
+-------------------+
9 rows selected (1.079 seconds)

JDBC
HiveServere2 has a new JDBC driver. It supports both embedded and remote access to HiveServer2.
The JDBC connection URL format has prefix is jdbc:hive2:// and the Driver class is
org.apache.hive.jdbc.HiveDriver. Note that this is different from the old hiveserver. For remote server, the
URL format is jdbc:hive2://<host>:<port>/<db> (default port for HiveServer2 is 10000). For embedded
server, the URL format is jdbc:hive2:// (no host or port).

JDBC Client Sample Code

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveJdbcClient {


private static String driverName = "org.apache.hive.jdbc.HiveDriver";

/**
* @param args
* @throws SQLException
*/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
//replace "hive" here with the name of the user the queries should run as
Connection con =
DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "hive",
"");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.execute("drop table if exists " + tableName);
stmt.execute("create table " + tableName + " (key int, value string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
ResultSet res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "\t" + res.getString(2));
}

// load data into table


// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
String filepath = "/tmp/a.txt";
sql = "load data local inpath '" + filepath + "' into table " +
tableName;
System.out.println("Running: " + sql);
stmt.execute(sql);

// select * query
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(String.valueOf(res.getInt(1)) + "\t" +
res.getString(2));
}

// regular hive query


sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
}
}
}

Running the JDBC Sample Code


{noformat}
# Then on the command-line
$ javac HiveJdbcClient.java
# To run the program in standalone mode, we need the following jars in the
classpath
# from hive/build/dist/lib
# hive-jdbc*.jar
# hive-service*.jar
# libfb303-0.9.0.jar# libthrift-0.9.0.jar# log4j-1.2.16.jar# slf4j-api-
1.6.1.jar# slf4j-log4j12-1.6.1.jar# commons-logging-1.0.4.jar#
#
# Following additional jars are needed for the kerberos secure mode -
# hive-exec*.jar
# commons-configuration-1.6.jar
# and from hadoop - hadoop-*core.jar
# To run the program in embedded mode, we need the following additional jars
in the classpath
# from hive/build/dist/lib
# hive-exec*.jar
# hive-metastore*.jar
# antlr-runtime-3.0.1.jar
# derby.jar
# jdo2-api-2.1.jar
# jpox-core-1.2.2.jar
# jpox-rdbms-1.2.2.jar
#
# from hadoop/build
# hadoop-*-core.jar
# as well as hive/build/dist/conf, any HIVE_AUX_JARS_PATH set, and hadoop
jars necessary to run MR jobs (eg lzo codec)

$ java -cp $CLASSPATH HiveJdbcClient

# Alternatively, you can run the following bash script, which will seed the
data file
# and build your classpath before invoking the client. The script adds all
the
# additional jars needed for using HiveServer2 in embedded mode as well.

#!/bin/bash
HADOOP_HOME=/your/path/to/hadoop
HIVE_HOME=/your/path/to/hive

echo -e '1\x01foo' > /tmp/a.txt


echo -e '2\x01bar' >> /tmp/a.txt

HADOOP_CORE={{ls $HADOOP_HOME/hadoop-*-core.jar}}
CLASSPATH=.:$HIVE_HOME/conf:`hadoop classpath`

for i in ${HIVE_HOME}/lib/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done

java -cp $CLASSPATH HiveJdbcClient

{noformat}
JDBC Client Setup for a Secure Cluster
When connecting to HiveServer2 with kerberos authentication, the URL format is
jdbc:hive2://<host>:<port>/<db>;principal=<Server_Principal_of_HiveServer2>. The client needs to have
a valid Kerberos ticket in the ticket cache before connecting.

In case of LDAP or customer pass through authentication, the client needs to pass the valid user name
and password to JDBC connection API.

This page documents changes that are visible to users.

 Hive Trunk (0.8.0-dev)


 Hive 0.7.1
 Hive 0.7.0
 Hive 0.6.0
 Hive 0.5.0

Hive Trunk (0.8.0-dev)

Hive 0.7.1

Hive 0.7.0
 HIVE-1790: Add support for HAVING clause.

Hive 0.6.0

Hive 0.5.0

Earliest version AvroSerde is available


The AvroSerde is available in Hive 0.9.1 and greater.

Overview - Working with Avro from Hive


The AvroSerde allows users to read or write Avro data as Hive tables. The AvroSerde's bullet points:

 Infers the schema of the Hive table from the Avro schema.
 Reads all Avro files within a table against a specified schema, taking advantage of Avro's
backwards compatibility abilities
 Supports arbitrarily nested schemas.
 Translates all Avro data types into equivalent Hive types. Most types map exactly, but some Avro
types don't exist in Hive and are automatically converted by the AvroSerde.
 Understands compressed Avro files.
 Transparently converts the Avro idiom of handling nullable types as Union[T, null] into just T and
returns null when appropriate.
 Writes any Hive table to Avro files.
 Has worked reliably against our most convoluted Avro schemas in our ETL process.

Requirements
The AvroSerde has been built and tested against Hive 0.9.1 and Avro 1.5.

Avro to Hive type conversion


While most Avro types convert directly to equivalent Hive types, there are some which do not exist in Hive
and are converted to reasonable equivalents. Also, the AvroSerde special cases unions of null and
another type, as described below:

Avro Becomes Hive Note


type type

null void

boolean boolean

int int

long bigint

float float

double double

bytes Array[smallint] Hive converts these to signed bytes.

string string

record struct

map map

list array

union union Unions of [T, null] transparently convert to nullable T, other types translate directly to
Hive's unions of those types. However, unions were introduced in Hive 7 and are not
currently able to be used in where/group-by statements. They are essentially look-at-
only. Because the AvroSerde transparently converts [T,null], to nullable T, this
limitation only applies to unions of multiple types or unions not of a single type and
null.

enum string Hive has no concept of enums

fixed Array[smallint] Hive converts the bytes to signed int

Creating Avro-backed Hive tables


To create a the Avro-backed table, specify the serde as org.apache.hadoop.hive.serde2.avro.AvroSerDe,
specify the inputformat as org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat, and the
outputformat as org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat. Also provide a location
from which the AvroSerde will pull the most current schema for the table. For example:
CREATE TABLE kst
PARTITIONED BY (ds string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='http://schema_provider/kst.avsc');

In this example we're pulling the source-of-truth reader schema from a webserver. Other options for
providing the schema are described below.
Add the Avro files to the database (or create an external table) using [standard Hive
operations](http://wiki.apache.org/hadoop/Hive/LanguageManual/DML).
This table might result in a description as below:
hive> describe kst;
OK
string1 string from deserializer
string2 string from deserializer
int1 int from deserializer
boolean1 boolean from deserializer
long1 bigint from deserializer
float1 float from deserializer
double1 double from deserializer
inner_record1
struct<int_in_inner_record1:int,string_in_inner_record1:string> from
deserializer
enum1 string from deserializer
array1 array<string> from deserializer
map1 map<string,string> from deserializer
union1 uniontype<float,boolean,string> from deserializer
fixed1 array<tinyint> from deserializer
null1 void from deserializer
unionnullint int from deserializer
bytes1 array<tinyint> from deserializer

At this point, the Avro-backed table can be worked with in Hive like any other table.

Writing tables to Avro files


The AvroSerde can serialize any Hive table to Avro files. This makes it effectively an any-Hive-type to
Avro converter. In order to write a table to an Avro file, you must first create an appropriate Avro schema.
Create as select type statements are not currently supported. Types translate as detailed in the table
above. For types that do not translate directly, there are a few items to keep in mind:

 Types that may be null must be defined as a union of that type and Null within Avro. A null
in a field that is not so defined with result in an exception during the save. No changes need be
made to the Hive schema to support this, as all fields in Hive can be null.
 Avro Bytes type should be defined in Hive as lists of tiny ints. the AvroSerde will convert these to
Bytes during the saving process.
 Avro Fixed type should be defined in Hive as lists of tiny ints. the AvroSerde will convert these to
Fixed during the saving process.
 Avro Enum type should be defined in Hive as strings, since Hive doesn't have a concept of
enums. Ensure that only valid enum values are present in the table - trying to save a non-defined
enum will result in an exception.

Example

Consider the following Hive table, which coincidentally covers all types of Hive data types, making it a
good example:
CREATE TABLE test_serializer(string1 STRING,
int1 INT,
tinyint1 TINYINT,
smallint1 SMALLINT,
bigint1 BIGINT,
boolean1 BOOLEAN,
float1 FLOAT,
double1 DOUBLE,
list1 ARRAY<STRING>,
map1 MAP<STRING,INT>,
struct1
STRUCT<sint:INT,sboolean:BOOLEAN,sstring:STRING>,
union1 uniontype<FLOAT, BOOLEAN, STRING>,
enum1 STRING,
nullableint INT,
bytes1 ARRAY<TINYINT>,
fixed1 ARRAY<TINYINT>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY
':' MAP KEYS TERMINATED BY '#' LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

To save this table as an Avro file, create an equivalent Avro schema (the namespace and actual name of
the record are not important):
{
"namespace": "com.linkedin.haivvreo",
"name": "test_serializer",
"type": "record",
"fields": [
{ "name":"string1", "type":"string" },
{ "name":"int1", "type":"int" },
{ "name":"tinyint1", "type":"int" },
{ "name":"smallint1", "type":"int" },
{ "name":"bigint1", "type":"long" },
{ "name":"boolean1", "type":"boolean" },
{ "name":"float1", "type":"float" },
{ "name":"double1", "type":"double" },
{ "name":"list1", "type":{"type":"array", "items":"string"} },
{ "name":"map1", "type":{"type":"map", "values":"int"} },
{ "name":"struct1", "type":{"type":"record", "name":"struct1_name",
"fields": [
{ "name":"sInt", "type":"int" }, { "name":"sBoolean",
"type":"boolean" }, { "name":"sString", "type":"string" } ] } },
{ "name":"union1", "type":["float", "boolean", "string"] },
{ "name":"enum1", "type":{"type":"enum", "name":"enum1_values",
"symbols":["BLUE","RED", "GREEN"]} },
{ "name":"nullableint", "type":["int", "null"] },
{ "name":"bytes1", "type":"bytes" },
{ "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3}
}
] }

If the table were backed by a csv such as:

why 4 3 1 1412 tr 42 85.23423 alpha:bet Earth#42:Contr 17:true:A 0:3.1 BL 72 0:1:2 50:5


hell 2 0 341 u .4 424 a:gamma ol#86:Bob#31 be 4145 UE :3:4: 1:53
o 0 e 3 Linkedin 9 5
the
re

ano 9 4 1 9999 fa 99 0.000000 beta Earth#101 1134:fals 1:tru RE N 6:7:8 54:5


the 8 0 999 ls .8 09 e:wazzup e D UL :9:10 5:56
r 1 e 9 L
rec
ord

thir 4 5 1 9999 tr 89 0.000000 alpha:ga Earth#237:Bob 102:false: 2:Tim GR N 11:12 57:5


d 5 0 9999 u .9 00000009 mma #723 BNL e to EE UL :13 8:59
rec 2 9 e 9 go N L
ord home

one can write it out to Avro with:


CREATE TABLE as_avro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='file:///path/to/the/schema/test_serializer.avsc');
insert overwrite table as_avro select * from test_serializer;

The files that are written by the Hive job are valid Avro files, however, MapReduce doesn't add the
standard .avro extension. If you copy these files out, you'll likely want to rename them with .avro.

Hive is very forgiving about types: it will attempt to store whatever value matches the provided column in
the equivalent column position in the new table. No matching is done on column names, for instance.
Therefore, it is incumbent on the query writer to make sure the the target column types are correct. If they
are not, Avro may accept the type or it may throw an exception, this is dependent on the particular
combination of types.

Specifying the Avro schema for a table


There are three ways to provide the reader schema for an Avro table, all of which involve parameters to
the serde. As the schema involves, one can update these values by updating the parameters in the table.

Use avro.schema.url

Specifies a url to access the schema from. For http schemas, this works for testing and small-scale
clusters, but as the schema will be accessed at least once from each task in the job, this can quickly turn
the job into a DDOS attack against the URL provider (a web server, for instance). Use caution when using
this parameter for anything other than testing.

The schema can also point to a location on HDFS, for instance: hdfs://your-nn:9000/path/to/avsc/file. the
AvroSerde will then read the file from HDFS, which should provide resiliency against many reads at once.
Note that the serde will read this file from every mapper, so it's a good idea to turn the replication of the
schema file to a high value to provide good locality for the readers. The schema file itself should be
relatively small, so this does not add a significant amount of overhead to the process.

Use schema.literal and embed the schema in the create statement

One can embed the schema directly into the create statement. This works if the schema doesn't have any
single quotes (or they are appropriately escaped), as Hive uses this to define the parameter value. For
instance:
CREATE TABLE embedded
COMMENT "just drop the schema right into the HQL"
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.literal'='{
"namespace": "com.howdy",
"name": "some_schema",
"type": "record",
"fields": [ { "name":"string1","type":"string"}]
}');

Note that the value is enclosed in single quotes and just pasted into the create statement.

Use avro.schema.literal and pass the schema into the script

Hive can do simple variable substitution and one can pass the schema embedded in a variable to the
script. Note that to do this, the schema must be completely escaped (carriage returns converted to \n,
tabs to \t, quotes escaped, etc). An example:
set hiveconf:schema;
DROP TABLE example;
CREATE TABLE example
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.literal'='${hiveconf:schema}');

To execute this script file, assuming $SCHEMA has been defined to be the escaped schema value:
hive -hiveconf schema="${SCHEMA}" -f your_script_file.sql

Note that $SCHEMA is interpolated into the quotes to correctly handle spaces within the schema.

Use none to ignore either avro.schema.literal or avro.schema.url

Hive does not provide an easy way to unset or remove a property. If you wish to switch from using url or
schema to the other, set the to-be-ignored value to none and the AvroSerde will treat it as if it were not
set.

If something goes wrong


Hive tends to swallow exceptions from the AvroSerde that occur before job submission. To force Hive to
be more verbose, it can be started with *hive -hiveconf hive.root.logger=INFO,console*, which will spit
orders of magnitude more information to the console and will likely include any information the AvroSerde
is trying to get you about what went wrong. If the AvroSerde encounters an error during MapReduce, the
stack trace will be provided in the failed task log, which can be examined from the JobTracker's web
interface. the AvroSerde only emits the AvroSerdeException; look for these. Please include these in any
bug reports. The most common is expected to be exceptions while attempting to serializing an
incompatible type from what Avro is expecting.

FAQ
 Why do I get error-error-error-error-error-error-error and a message to check
avro.schema.literal and avro.schema.url when describing a table or running a query against a
table?

The AvroSerde returns this message when it has trouble finding or parsing the schema provided by either
the avro.schema.literal or avro.avro.schema.url value. It is unable to be more specific because Hive
expects all calls to the serde config methods to be successful, meaning we are unable to return an actual
exception. By signaling an error via this message, the table is left in a good state and the incorrect value
can be corrected with a call to alter table T set TBLPROPERTIES.

nstalling Hive
Installing Hive is simple and only requires having Java 1.6 and Ant installed on your machine.

Hive is available via SVN at http://svn.apache.org/repos/asf/hive/trunk. You can download it by running


the following command.
$ svn co http://svn.apache.org/repos/asf/hive/trunk hive

To build hive, execute the following command on the base directory:


$ ant package

It will create the subdirectory build/dist with the following contents:

 README.txt: readme file.


 bin/: directory containing all the shell scripts
 lib/: directory containing all required jar files)
 conf/: directory with configuration files
 examples/: directory with sample input and query files

Subdirectory build/dist should contain all the files necessary to run hive. You can run it from there or copy
it to a different location, if you prefer.

In order to run Hive, you must have hadoop in your path or have defined the environment variable
HADOOP_HOME with the hadoop installation directory.

Moreover, we strongly advise users to create the HDFS directories /tmp and /user/hive/warehouse
(aka hive.metastore.warehouse.dir) and set them chmod g+w before tables are created in Hive.

To use hive command line interface (cli) go to the hive home directory (the one with the contents of
build/dist) and execute the following command:
$ bin/hive

Metadata is stored in an embedded Derby database whose disk storage location is determined by the
hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hive-
default.xml), this location is ./metastore_db

Using Derby in embedded mode allows at most one user at a time. To configure Derby to run in server
mode, look at HiveDerbyServerMode.

Configuring Hive
A number of configuration variables in Hive can be used by the administrator to change the behavior for
their installations and user sessions. These variables can be configured in any of the following ways,
shown in the order of preference:

 Using the set command in the cli for setting session level values for the configuration variable for
all statements subsequent to the set command. e.g.
 set hive.exec.scratchdir=/tmp/mydir;
sets the scratch directory (which is used by hive to store temporary output and plans)
to /tmp/mydir for all subseq
 Using -hiveconf option on the cli for the entire session. e.g.
 bin/hive -hiveconf hive.exec.scratchdir=/tmp/mydir
 In hive-site.xml. This is used for setting values for the entire Hive configuration. e.g.
 <property>
 <name>hive.exec.scratchdir</name>
 <value>/tmp/mydir</value>
 <description>Scratch space for Hive jobs</description>
 </property>

hive-default.xml.template contains the default values for various configuration variables that
come with prepackaged in a Hive distribution. In order to override any of the values, create hive-
site.xml instead and set the value in that file as shown above. Please note that this file is not used by
Hive at all (as of Hive 0.9.0) and so it might be out of date or out of sync with the actual values. The
canonical list of configuration options is now only managed in the HiveConf java class.

hive-default.xml.template is located in the conf directory in your installation root. hive-


site.xml should also be created in the same directory.

Broadly the configuration variables are categorized into:

Hive Configuration Variables


Variable Name Description Default Value

hive.ddl.output.format The data format to use for DDL output text


(e.g. DESCRIBE table). One of "text"
(for human readable text) or "json" (for
a json object). (as of Hive 0.9.0)

hive.exec.script.wrapper Wrapper around any invocations to null


script operator e.g. if this is set to
python, the script passed to the script
operator will be invoked as python
<script command>. If the value is
null or not set, the script is invoked
as <script command>.

hive.exec.plan null

hive.exec.scratchdir This directory is used by hive to store /tmp/<user.name>/hive


the plans for different map/reduce
stages for the query as well as to stored
the intermediate outputs of these
stages.

hive.exec.submitviachild Determines whether the map/reduce false - By default jobs are submitted
jobs should be submitted through a through the same jvm as the compiler
separate jvm in the non local mode.

hive.exec.script.maxerrsize Maximum number of serialization errors 100000


allowed in a user script invoked
through TRANSFORM or MAP or REDUC
E constructs.

hive.exec.compress.output Determines whether the output of the false


final map/reduce job in a query is
compressed or not.

hive.exec.compress.interm Determines whether the output of the false


ediate intermediate map/reduce jobs in a
query is compressed or not.

hive.jar.path The location of hive_cli.jar that is used


when submitting jobs in a separate jvm.

hive.aux.jars.path The location of the plugin jars that


contain implementations of user defined
functions and serdes.

hive.partition.pruning A strict value for this variable indicates nonstrict


that an error is thrown by the compiler
in case no partition predicate is provided
on a partitioned table. This is used to
protect against a user inadvertently
issuing a query against all the partitions
of the table.

hive.map.aggr Determines whether the map side true


aggregation is on or not.

hive.join.emit.interval 1000

hive.map.aggr.hash.percen (float)0.5
tmemory

hive.default.fileformat Default file format for CREATE TABLE TextFile


statement. Options are TextFile,
SequenceFile and RCFile

hive.merge.mapfiles Merge small files at the end of a map- true


only job.

hive.merge.mapredfiles Merge small files at the end of a map- false


reduce job.

hive.merge.size.per.task Size of merged files at the end of the 256000000


job.

hive.merge.smallfiles.avgsi When the average output file size of a 16000000


ze job is less than this number, Hive will
start an additional map-reduce job to
merge the output files into bigger files.
This is only done for map-only jobs if
hive.merge.mapfiles is true, and for
map-reduce jobs if
hive.merge.mapredfiles is true.

hive.querylog.enable.plan. Whether to log the plan's progress every true


progress time a job's progress is checked. These
logs are written to the location specified
byhive.querylog.location (as of
Hive 0.10)

hive.querylog.location Directory where structured hive query /tmp/<user.name>


logs are created. One file per session is
created in this directory. If this variable
set to empty string structured log will
not be created.

hive.querylog.plan.progres The interval to wait between logging the 60000


s.interval plan's progress in milliseconds. If there is
a whole number percentage change in
the progress of the mappers or the
reducers, the progress is logged
regardless of this value. The actual
interval will be the ceiling of (this value
divided by the value
of hive.exec.counters.pull.in
terval) multiplied by the value
ofhive.exec.counters.pull.in
terval i.e. if it is not divide evenly by
the value
of hive.exec.counters.pull.in
tervalit will be logged less frequently
than specified. This only has an effect
if hive.querylog.enable.plan.
progress is set totrue. (as of
Hive 0.10)

hive.stats.autogather A flag to gather statistics automatically true


during the INSERT OVERWRITE
command. (as of Hive 0.7.0)

hive.stats.dbclass The default database that stores jdbc:derby


temporary hive statistics. Valid values
are hbase and jdbc while jdbc shoul
d have a specification of the Database to
use, separatey by a colon
(e.g. jdbc:mysql (as of Hive 0.7.0)

hive.stats.dbconnectionstri The default connection string for the jdbc:derby:;databaseName=TempStatsSto


ng database that stores temporary hive re;create=true
statistics. (as of Hive 0.7.0)

hive.stats.jdbcdriver The JDBC driver for the database that org.apache.derby.jdbc.EmbeddedDriver


stores temporary hive statistics. (as of
Hive 0.7.0)

hive.stats.reliable Whether queries will fail because stats false


cannot be collected completely
accurately. If this is set to true,
reading/writing from/into a partition
may fail becuase the stats could not be
computed accurately (as of Hive 0.10.0)

hive.enforce.bucketing If enabled, enforces inserts into false


bucketed tables to also be bucketed

hive.variable.substitute Substitutes variables in Hive statements true


which were previously set using
the set command, system variables or
environment variables. See HIVE-
1096 for details. (as of Hive 0.7.0)

hive.variable.substitute.de The maximum replacements the 40


pth substitution engine will do. (as of
Hive 0.10.0)
Hive Metastore Configuration Variables
Please see the Admin Manual's section on the Metastore for details.

Hive Configuration Variables used to interact with Hadoop


Variable Name Description Default Value

hadoop.bin.path The location of hadoop script which is used to submit jobs $HADOOP_HOME/bin/hadoop
to hadoop when submitting through a separate jvm.

hadoop.config.dir The location of the configuration directory of the hadoop $HADOOP_HOME/conf


installation

fs.default.name file:///

map.input.file null

mapred.job.tracker The url to the jobtracker. If this is set to local then local
map/reduce is run in the local mode.

mapred.reduce.tasks The number of reducers for each map/reduce stage in the 1


query plan.

mapred.job.name The name of the map/reduce job null

Hive Variables used to pass run time information


Variable Name Description Default
Value

hive.session.id The id of the Hive Session.

hive.query.string The query string passed to the map/reduce job.

hive.query.planid The id of the plan for the map/reduce stage.

hive.jobname.length The maximum length of the jobname. 50

hive.table.name The name of the hive table. This is passed to the user scripts through the
script operator.

hive.partition.name The name of the hive partition. This is passed to the user scripts through the
script operator.

hive.alias The alias being processed. This is also passed to the user scripts through the
script operator.

Temporary Folders
Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance.
These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up
by the hive client when the query is finished. However, in cases of abnormal hive client termination, some
data may be left behind. The configuration details are as follows:

 On the HDFS cluster this is set to /tmp/hive-<username> by default and is controlled by the
configuration variable hive.exec.scratchdir
 On the client machine, this is hardcoded to /tmp/<username>

Note that when writing data to a table/partition, Hive will first write to a temporary location on the target
table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the
target table. This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems
like S3 or even NFS.

Log Files
Hive client produces logs and history files on the client machine. Please see Error Logs on configuration
details.

 Introduction
 Embedded Metastore
 Local Metastore
 Remote Metastore

Introduction
All the metadata for Hive tables and partitions are stored in Hive Metastore. Metadata is persisted
using JPOX ORM solution so any store that is supported by it. Most of the commercial relational
databases and many open source datstores are supported. Any datastore that has JDBC driver can
probably be used.

You can find an E/R diagram for the metastore here.

There are 3 different ways to setup metastore server using different Hive configurations. The relevant
configuration parameters are

Config Param Description

javax.jdo.option.ConnectionURL JDBC connection string for the data store which contains metadata
javax.jdo.option.ConnectionDriverName JDBC Driver class name for the data store which contains metadata

hive.metastore.uris Hive connects to this URI to make metadata requests for a remote
metastore

hive.metastore.local local or remote metastore (Removed as of Hive 0.10:


If hive.metastore.uris is empty local mode is
assumed, remote otherwise)

hive.metastore.warehouse.dir URI of the default location for native tables

These variables were carried over from old documentation without a guarantee that they all still exist:

Variable Name Description Default


Value

hive.metastore.metadb.dir

hive.metastore.usefilestore

hive.metastore.rawstore.impl

org.jpox.autoCreateSchema Creates necessary schema on startup if one doesn't exist. (e.g.


tables, columns...) Set to false after creating it once.

org.jpox.fixedDatastore Whether the datastore schema is fixed.

hive.metastore.checkForDefaultDb

hive.metastore.ds.connection.url.hook Name of the hook to use for retriving the JDO connection URL.
If empty, the value in javax.jdo.option.ConnectionURL is used
as the connection URL

hive.metastore.ds.retry.attempts The number of times to retry a call to the backing datastore if 1


there were a connection error

hive.metastore.ds.retry.interval The number of miliseconds between datastore retry attempts 1000

hive.metastore.server.min.threads Minimum number of worker threads in the Thrift server's pool. 200

hive.metastore.server.max.threads Maximum number of worker threads in the Thrift server's 10000


pool.

Default configuration sets up an embedded metastore which is used in unit tests and is described in the
next section. More practical options are described in the subsequent sections.
Embedded Metastore
Mainly used for unit tests and only one process can connect to metastore at a time. So it is not really a
practical solution but works well for unit tests.

Config Param Config Value Comment

javax.jdo.option.Connectio jdbc:derby:;databaseName=../build/test/junit_me derby


nURL tastore_db;create=true database
located at
hive/trunk/
build...

javax.jdo.option.Connectio org.apache.derby.jdbc.EmbeddedDriver Derby


nDriverName embeded
JDBC driver
class

hive.metastore.uris not needed since this is a local metastore

hive.metastore.local true embeded is


local

file://${user.dir}/../build/ql/test/data/wareho unit test


hive.metastore.warehouse.
use data goes in
dir
here on
your local
filesystem

If you want to run the metastore as a network server so it can be accessed from multiple nodes try
HiveDerbyServerMode.

Local Metastore
In local metastore setup, each Hive Client will open a connection to the datastore and make SQL queries
against it. The following config will setup a metastore in a MySQL server. Make sure that the server
accessible from the machines where Hive queries are executed since this is a local store. Also the jdbc
client library is in the classpath of Hive Client.

Config Param Config Value Comment

javax.jdo.option.ConnectionURL jdbc:mysql://<host name>/<database metadata is


name>?createDatabaseIfNotExist=true stored in a
MySQL server

javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver MySQL JDBC


driver class

javax.jdo.option.ConnectionUserName <user name> user name for


connecting to
mysql server

javax.jdo.option.ConnectionPassword <password> password for


connecting to
mysql server

hive.metastore.uris not needed because this is local store

hive.metastore.local true this is local


store

hive.metastore.warehouse.dir <base hdfs path> default


location for
Hive tables.

Remote Metastore
In remote metastore setup, all Hive Clients will make a connection a metastore server which in turn
queries the datastore (MySQL in this example) for metadata. Metastore server and client communicate
using Thrift Protocol. Starting with Hive 0.5.0, you can start a thrift server by executing the following
command:
hive --service metastore

In versions of Hive earlier than 0.5.0, it's instead necessary to run the thrift server via direct execution of
Java:
$JAVA_HOME/bin/java -Xmx1024m -
Dlog4j.configuration=file://$HIVE_HOME/conf/hms-log4j.properties -
Djava.library.path=$HADOOP_HOME/lib/native/Linux-amd64-64/ -cp $CLASSPATH
org.apache.hadoop.hive.metastore.HiveMetaStore

If you execute Java directly, then JAVA_HOME, HIVE_HOME, HADOOP_HOME must be correctly set;
CLASSPATH should contain Hadoop, Hive (lib and auxlib), and Java jars.

Server Configuration Parameters

Config Param Config Value Comment

javax.jdo.option.ConnectionURL jdbc:mysql://<host name>/<database metadata is stored


name>?createDatabaseIfNotExist=true in a MySQL server

javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver MySQL JDBC driver


class
javax.jdo.option.ConnectionUserName <user name> user name for
connecting to mysql
server

javax.jdo.option.ConnectionPassword <password> password for


connecting to mysql
server

hive.metastore.warehouse.dir <base hdfs path> default location for


Hive tables.

Client Configuration Parameters

Config Param Config Value Comment

hive.metastore.uris thrift://<host_name>:<port> host and port for the thrift metastore server

hive.metastore.local false this is local store

hive.metastore.warehouse.dir <base hdfs path> default location for Hive tables.

If you are using MySQL as the datastore for metadata, put MySQL client libraries in HIVE_HOME/lib
before starting Hive Client or HiveMetastore Server.

Hive Web Interface

What is the Hive Web Interface


The Hive web interface is a an alternative to using the Hive command line interface. Using the web
interface is a great way to get started with hive.

Features

Schema Browsing
An alternative to running 'show tables' or 'show extended tables' from the CLI is to use the web based
schema browser. The Hive Meta Data is presented in a hierarchical manner allowing you to start at the
database level and click to get information about tables including the SerDe, column names, and column
types.
Detached query execution
A power user issuing multiple hive queries simultaneously would have multiple CLI windows open. The
hive web interface manages the session on the web server, not from inside the CLI window. This allows a
user to start multiple queries and return to the web interface later to check the status.

No local installation
Any user with a web browser can work with Hive. This has the usual web interface benefits, In particular a
user wishing to interact with hadoop or hive requires access to many ports. A remote or VPN user would
only require access to the hive web interface running by default on 0.0.0.0 tcp/9999.

Configuration
Hive Web Interface made its first appearance in the 0.2 branch. If you have 0.2 or the SVN trunk you
already have it.

You should not need to edit the defaults for the Hive web interface. HWI uses:
<property>
<name>hive.hwi.listen.host</name>
<value>0.0.0.0</value>
<description>This is the host address the Hive Web Interface will listen
on</description>
</property>

<property>
<name>hive.hwi.listen.port</name>
<value>9999</value>
<description>This is the port the Hive Web Interface will listen
on</description>
</property>

<property>
<name>hive.hwi.war.file</name>
<value>${HIVE_HOME}/lib/hive_hwi.war</value>
<description>This is the WAR file with the jsp content for Hive Web
Interface</description>
</property>

You probably want to setup HiveDerbyServerMode to allow multiple sessions at the same time.

Start up
When initializing Hive with no arguments that CLI is invoked. Hive has an extension architecture used to
start other hive demons.
Jetty requires apache ant to start HWI. You should define ANT_LIB as an environment variable or add
that to the hive invocation.
export ANT_LIB=/opt/ant/lib
bin/hive --service hwi
Java has no direct way of demonizing. In a production environment you should create a wrapper script.
nohup bin/hive --service hwi > /dev/null 2> /dev/null &

If you want help on the service invocation or list of parameters you can add
bin/hive --service hwi --help

Authentication
Hadoop currently uses environmental properties to determine user name and group vector. Thus Hive
and Hive Web Interface can not enforce more stringent security then Hadoop can. When you first connect
to the Hive Web Interface the user is prompted for a user name and groups. This feature was added to
support installations using different schedulers.

If you want to tighten up security you are going to need to patch the source Hive Session Manager or you
may be able to tweak the JSP to accomplish this.

Accessing
In order to access the Hive Web Interface, go to <Hive Server Address>:9999/hwi on your web browser.

Tips and tricks

Result file
The result file is local to the web server. A query that produces massive output should set the result file to
/dev/null.

Debug Mode
The debug mode is used when the user is interested in having the result file not only contain the result of
the hive query but the other messages.

Set Processor
In the CLI a command like 'SET x=5' is not processed by the the Query Processor it is processed by
the Set Processor. Use the form 'x=5' not 'set x=5'

Walk through
Authorize
Unable to render embedded object: File (1_hwi_authorize.png) not found.
Unable to render embedded object: File (2_hwi_authorize.png) not found.

Schema Browser
Unable to render embedded object: File (3_schema_table.png) not found.
Unable to render embedded object: File (4_schema_browser.png) not found.

Diagnostics
Unable to render embedded object: File (5_diagnostic.png) not found.

Running a query
Unable to render embedded object: File (6_newsession.png) not found.
Unable to render embedded object: File (7_session_runquery.png) not found.
Unable to render embedded object: File (8_session_query_1.png) not found.
Unable to render embedded object: File (9_file_view.png) not found.

Setting Up Hive Server


 Setting up HiveServer2
 Setting Up Thrift Hive Server
 Setting Up Hive JDBC Server
 Setting Up Hive ODBC Server

Child Pages (1)


Hide Child Pages | Reorder Pages
Page: Setting up HiveServer2

= Hive and Amazon Web Services =

Background
This document explores the different ways of leveraging Hive on Amazon Web Services -
namely S3, EC2 and Elastic Map-Reduce.

Hadoop already has a long tradition of being run on EC2 and S3. These are well documented in the links
below which are a must read:
 Hadoop and S3
 Amazon and EC2

The second document also has pointers on how to get started using EC2 and S3. For people who are
new to S3 - there's a few helpful notes in S3 for n00bs section below. The rest of the documentation
below assumes that the reader can launch a hadoop cluster in EC2, copy files into and out of S3 and run
some simple Hadoop jobs.

Introduction to Hive and AWS


There are three separate questions to consider when running Hive on AWS:

1. Where to run the Hive CLI from and store the metastore db (that contains table and schema
definitions).
2. How to define Hive tables over existing datasets (potentially those that are already in S3)
3. How to dispatch Hive queries (which are all executed using one or more map-reduce programs)
to a Hadoop cluster running in EC2.

We walk you through the choices involved here and show some practical case studies that contain
detailed setup and configuration instructions.

Running the Hive CLI


The CLI takes in Hive queries, compiles them into a plan (commonly, but not always, consisting of map-
reduce jobs) and then submits them to a Hadoop Cluster. While it depends on Hadoop libraries for this
purpose - it is otherwise relatively independent of the Hadoop cluster itself. For this reason the CLI can be
run from any node that has a Hive distribution, a Hadoop distribution, a Java Runtime Engine. It can
submit jobs to any compatible hadoop cluster (whose version matches that of the Hadoop libraries that
Hive is using) that it can connect to. The Hive CLI also needs to access table metadata. By default this is
persisted by Hive via an embedded Derby database into a folder named metastore_db on the local file
system (however state can be persisted in any database - including remote mysql instances).

There are two choices on where to run the Hive CLI from:

1. Run Hive CLI from within EC2 - the Hadoop master node being the obvious choice. There are
several problems with this approach:
 Lack of comprehensive AMIs that bundle different versions of Hive and Hadoop
distributions (and the difficulty in doing so considering the large number of such
combinations). Cloudera provides some AMIs that bundle Hive with Hadoop - although
the choice in terms of Hive and Hadoop versions may be restricted.
 Any required map-reduce scripts may also need to be copied to the master/Hive node.
 If the default Derby database is used - then one has to think about persisting state
beyond the lifetime of one hadoop cluster. S3 is an obvious choice - but the user must
restore and backup Hive metadata at the launch and termination of the Hadoop cluster.

2. Run Hive CLI remotely from outside EC2. In this case, the user installs a Hive distribution on a
personal workstation, - the main trick with this option is connecting to the Hadoop cluster - both for
submitting jobs and for reading and writing files to HDFS. The section on Running jobs from a remote
machine details how this can be done. [Case Study 1] goes into the setup for this in more detail. This
option solves the problems mentioned above:
 Stock Hadoop AMIs can be used. The user can run any version of Hive on their workstation,
launch a Hadoop cluster with the desired Hadoop version etc. on EC2 and start running queries.
 Map-reduce scripts are automatically pushed by Hive into Hadoop's distributed cache at job
submission time and do not need to be copied to the Hadoop machines.
 Hive Metadata can be stored on local disk painlessly.

However - the one downside of Option 2 is that jar files are copied over to the Hadoop cluster for each
map-reduce job. This can cause high latency in job submission as well as incur some AWS network
transmission costs. Option 1 seems suitable for advanced users who have figured out a stable Hadoop
and Hive (and potentially external libraries) configuration that works for them and can create a new AMI
with the same.

Loading Data into Hive Tables


It is useful to go over the main storage choices for Hadoop/EC2 environment:

 S3 is an excellent place to store data for the long term. There are a couple of choices on how S3
can be used:
 Data can be either stored as files within S3 using tools like aws and s3curl as detailed
in S3 for n00bs section. This suffers from the restriction of 5G limit on file size in S3. But
the nice thing is that there are probably scores of tools that can help in
copying/replicating data to S3 in this manner. Hadoop is able to read/write such files
using the S3N filesystem.
 Alternatively Hadoop provides a block based file system using S3 as a backing store.
This does not suffer from the 5G max file size restriction. However - Hadoop utilities and
libraries must be used for reading/writing such files.
 HDFS instance on the local drives of the machines in the Hadoop cluster. The lifetime of this is
restricted to that of the Hadoop instance - hence this is not suitable for long lived data. However it
should provide data that can be accessed much faster and hence is a good choice for
intermediate/tmp data.

Considering these factors, the following makes sense in terms of Hive tables:

1. For long-lived tables, use S3 based storage mechanisms


2. For intermediate data and tmp tables, use HDFS

[Case Study 1] shows you how to achieve such an arrangement using the S3N filesystem.

If the user is running Hive CLI from their personal workstation - they can also use Hive's 'load data local'
commands as a convenient alternative (to dfs commands) to copy data from their local filesystems
(accessible from their workstation) into tables defined over either HDFS or S3.

Submitting jobs to a Hadoop cluster


This applies particularly when Hive CLI is run remotely. A single Hive CLI session can switch across
different hadoop clusters (especially as clusters are bought up and terminated). Only two configuration
variables:

 fs.default.name
 mapred.job.tracker
need to be changed to point the CLI from one Hadoop cluster to another. Beware though that
tables stored in previous HDFS instance will not be accessible as the CLI switches from one
cluster to another. Again - more details can be found in [Case Study 1].

Case Studies
1. [Querying files in S3 using EC2, Hive and Hadoop ]

Appendix
<<Anchor(S3n00b)>>

S3 for n00bs
One of the things useful to understand is how S3 is used as a file system normally. Each S3
bucket can be considered as a root of a File System. Different files within this filesystem become
objects stored in S3 - where the path name of the file (path components joined with '/') become
the S3 key within the bucket and file contents become the value. Different tools like [S3Fox|https:-
-addons.mozilla.org-en-US-firefox-addon-3247] and native S3 !FileSystem in Hadoop (s3n) show
a directory structure that's implied by the common prefixes found in the keys. Not all tools are
able to create an empty directory. In particular - S3Fox does (by creating a empty key
representing the directory). Other popular tools like aws, s3cmd and s3curl provide convenient
ways of accessing S3 from the command line - but don't have the capability of creating empty
directories.

Amazon Elastic MapReduce and Hive


Amazon Elastic MapReduce is a web service that makes it easy to launch managed, resizable Hadoop
clusters on the web-scale infrastructure of Amazon Web Services (AWS). Elastic Map Reduce makes it
easy for you to launch a Hive and Hadoop cluster, provides you with flexibility to choose different cluster
sizes, and allows you to tear them down automatically when processing has completed. You pay only for
the resources that you use with no minimums or long-term commitments.

Amazon Elastic MapReduce simplifies the use of Hive clusters by:

1. Handling the provisioning of Hadoop clusters of up to thousands of EC2 instances


2. Installing Hadoop across the master and slave nodes of your cluster and configuring Hadoop
based on your chosen hardware
3. Installing Hive on the master node of your cluster and configuring it for communication with the
Hadoop JobTracker and NameNode
4. Providing a simple API, a web UI, and purpose-built tools for managing, monitoring, and
debugging Hadoop tasks throughout the life of the cluster
5. Providing deep integration, and optimized performance, with AWS services such as S3 and EC2
and AWS features such as Spot Instances, Elastic IPs, and Identity and Access Management
(IAM)

Please refer to the following link to view the Amazon Elastic MapReduce Getting Started Guide:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

Amazon Elastic MapReduce provides you with multiple clients to run your Hive cluster. You can launch a
Hive cluster using the AWS Management Console, the Amazon Elastic MapReduce Ruby Client, or the
AWS Java SDK. You may also install and run multiple versions of Hive on the same cluster, allowing you
to benchmark a newer Hive version alongside your previous version. You can also install a newer Hive
version directly onto an existing Hive cluster.

Supported versions:
Hadoop Version Hive Version

0.18 0.4

0.20 0.5, 0.7, 0.7.1

Hive Defaults

Thrift Communication port


Hive Version Thrift port

0.4 10000

0.5 10000

0.7 10001

0.7.1 10002

Log File
Hive Version Log location
0.4 /mnt/var/log/apps/hive.log

0.5 /mnt/var/log/apps/hive_05.log

0.7 /mnt/var/log/apps/hive_07.log

0.7.1 /mnt/var/log/apps/hive_07_1.log

MetaStore
By default, Amazon Elastic MapReduce uses MySQL, preinstalled on the Master Node, for its Hive
metastore. Alternatively, you can use the Amazon Relational Database Service (Amazon RDS) to ensure
the metastore is persisted beyond the life of your cluster. This also allows you to share the metastore
between multiple Hive clusters. Simply override the default location of the MySQL database to the
external persistent storage location.

Hive CLI
EMR configures the master node to allow SSH access. You can log onto the master node and execute
Hive commands using the Hive CLI. If you have multiple versions of Hive installed on the cluster you can
access each one of them via a separate command:

Hive Version Hive command

0.4 hive

0.5 hive-0.5

0.7 hive-0.7

0.7.1 hive-0.7.1

EMR sets up a separate Hive metastore and Hive warehouse for each installed Hive version on a given
cluster. Hence, creating tables using one version does not interfere with the tables created using another
version installed. Please note that if you point multiple Hive tables to same location, updates to one table
become visible to other tables.

Hive Server
EMR runs a Thrift Hive server on the master node of the Hive cluster. It can be accessed using any JDBC
client (for example, squirrel SQL) via Hive JDBC drivers. The JDBC drivers for different Hive versions can
be downloaded via the following links:

Hive Version Hive JDBC

0.5 http://aws.amazon.com/developertools/0196055244487017

0.7 http://aws.amazon.com/developertools/1818074809286277

0.7.1 http://aws.amazon.com/developertools/8084613472207189

Here is the process to connect to the Hive Server using a JDBC driver:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Hive.html#Hiv
eJDBCDriver

Running Batch Queries


You can also submit queries from the command line client remotely. Please note that currently there is a
limit of 256 steps on each cluster. If you have more than 256 steps to execute, it is recommended that
you run the queries directly using the Hive CLI or submit queries via a JDBC driver.

Hive S3 Tables
An Elastic MapReduce Hive cluster comes configured for communication with S3. You can create tables
and point them to your S3 location and Hive and Hadoop will communicate with S3 automatically using
your provided credentials.

Once you have moved data to an S3 bucket, you simply point your table to that location in S3 in order to
read or process data via Hive. You can also create partitioned tables in S3. Hive on Elastic MapReduce
provides support for dynamic partitioning in S3.

Hive Logs
Hive application logs: All Hive application logs are redirected to /mnt/var/log/apps/ directory.

Hadoop daemon logs: Hadoop daemon logs are available in /mnt/var/log/hadoop/ folder.

Hadoop task attempt logs are available in /mnt/var/log/hadoop/userlogs/ folder on each slave node in the
cluster.

Tutorials
The following Hive tutorials are available for you to get started with Hive on Elastic MapReduce:

1. Finding trending topics using Google Books n-grams data and Apache Hive on Elastic
MapReduce
 http://aws.amazon.com/articles/Elastic-MapReduce/5249664154115844
2. Contextual Advertising using Apache Hive and Amazon Elastic MapReduce with High
Performance Computing instances
 http://aws.amazon.com/articles/Elastic-MapReduce/2855
3. Operating a Data Warehouse with Hive, Amazon Elastic MapReduce and Amazon SimpleDB
 http://aws.amazon.com/articles/Elastic-MapReduce/2854
4. Running Hive on Amazon ElasticMap Reduce
 http://aws.amazon.com/articles/2857

In addition, Amazon provides step-by-step video tutorials:

 http://aws.amazon.com/articles/2862

Support
You can ask questions related to Hive on Elastic MapReduce on Elastic MapReduce forums at:

https://forums.aws.amazon.com/forum.jspa?forumID=52

Please also refer to the EMR developer guide for more information:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/

Contributed by: Vaibhav Aggarwal