HP Vertica 7.1.x IntegratingApacheHadoop

Hadoop Integration Guide
HP Vertica Analytic Database

Software Version: 7.1.x
Document Release Date: 7/21/2016
Legal Notices
Warranty
The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be
construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.
The information contained herein is subject to change without notice.
Restricted Rights Legend

Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer
Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial
license.
Copyright Notice
Copyright 2006 - 2015 Hewlett-Packard Development Company, L.P.
Trademark Notices
Adobe is a trademark of Adobe Systems Incorporated.
Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation.
UNIX is a registered trademark of The Open Group.
Page 2 of 135
Contents
Contents
How HP Vertica and Apache Hadoop Work Together
HP Vertica's Integrations for Apache Hadoop
Cluster Layout
10
Choosing Which Hadoop Interface to Use
12
Using the HP Vertica Connector for Hadoop MapReduce
15
HP Vertica Connector for Hadoop Features
15
Prerequisites
16
Hadoop and HP Vertica Cluster Scaling
16
Installing the Connector
16
Accessing HP Vertica Data From Hadoop
19
Selecting VerticaInputFormat
19
Setting the Query to Retrieve Data From HP Vertica
20
Using a Simple Query to Extract Data From HP Vertica
20
Using a Parameterized Query and Parameter Lists
21
Using a Discrete List of Values
21
Using a Collection Object
21
Scaling Parameter Lists for the Hadoop Cluster
22
Using a Query to Retrieve Parameter Values for a Parameterized Query

Writing a Map Class That Processes HP Vertica Data
Working with the VerticaRecord Class
Writing Data to HP Vertica From Hadoop
23
23
23
25
Configuring Hadoop to Output to HP Vertica
25
Defining the Output Table
25
Writing the Reduce Class
27
Page 3 of 135

Contents
Storing Data in the VerticaRecord
27
Passing Parameters to the HP Vertica Connector for Hadoop Map Reduce At Run Time
30
Specifying the Location of the Connector .jar File
30
Specifying the Database Connection Parameters
30
Parameters for a Separate Output Database
31
Example HP Vertica Connector for Hadoop Map Reduce Application
32
Compiling and Running the Example Application
36
Compiling the Example (optional)
37
Running the Example Application
38
Verifying the Results
40
Using Hadoop Streaming with the HP Vertica Connector for Hadoop Map Reduce
40
Reading Data From HP Vertica in a Streaming Hadoop Job
40
Writing Data to HP Vertica in a Streaming Hadoop Job
43
Loading a Text File From HDFS into HP Vertica
44
Accessing HP Vertica From Pig
46
Registering the HP Vertica .jar Files
46
Reading Data From HP Vertica
47
Writing Data to HP Vertica
48
Using the HP Vertica Connector for HDFS
49
HP Vertica Connector for HDFS Requirements
49
Kerberos Authentication Requirements
50
Testing Your Hadoop WebHDFS Configuration
50
Installing the HP Vertica Connector for HDFS
53
Downloading and Installing the HP Vertica Connector for HDFS Package
53
Loading the HDFS User Defined Source
54
Loading Data Using the HP Vertica Connector for HDFS
55
The HDFS File URL
HP Vertica Analytic Database (7.1.x)
56
Page 4 of 135

Contents
Copying Files in Parallel
56
Viewing Rejected Rows and Exceptions
58
Creating an External Table Based on HDFS Files

Load Errors in External Tables
Installing and Configuring Kerberos on Your HP Vertica Cluster
58
59
60
Installing and Configuring Kerberos on Your HP Vertica Cluster
61
Preparing Keytab Files for the HP Vertica Connector for HDFS
62
HP Vertica Connector for HDFS Troubleshooting Tips
63
User Unable to Connect to Kerberos-Authenticated Hadoop Cluster
63
Resolving Error 5118
64
Transfer Rate Errors
65
Using the HCatalog Connector
67
Hive, HCatalog, and WebHCat Overview
67
HCatalog Connection Features
67
HCatalog Connection Considerations
68
Reading ORC Files Directly
68
Syntax
69
Supported Data Types
69
Kerberos
70
Query Performance
70
Examples
70
How the HCatalog Connector Works
71
HCatalog Connector Requirements
72
HP Vertica Requirements
72
Hadoop Requirements
72
Testing Connectivity
72
Installing the Java Runtime on Your HP Vertica Cluster
74
Page 5 of 135

Contents
Installing a Java Runtime
74
Setting the JavaBinaryForUDx Configuration Parameter
75
Configuring HP Vertica for HCatalog
76
Copy Hadoop JARs and Configuration Files
76
Install the HCatalog Connector
78
Using the HCatalog Connector with HA NameNode
78
Defining a Schema Using the HCatalog Connector
79
Querying Hive Tables Using HCatalog Connector
80
Viewing Hive Schema and Table Metadata
81
Synching an HCatalog Schema With a Local Schema
86
Data Type Conversions from Hive to HP Vertica
87
Data-Width Handling Differences Between Hive and HP Vertica

Using Non-Standard SerDes
88
89
Determining Which SerDe You Need
89
Installing the SerDe on the HP Vertica Cluster
90
Troubleshooting HCatalog Connector Problems
90
Connection Errors
91
UDx Failure WhenQuerying Data:Error 3399
92
SerDe Errors
93
Differing Results Between Hive and HP Vertica Queries
93
Preventing Excessive Query Delays
93
Using the HP Vertica Storage Location for HDFS

Storage Location for HDFS Requirements
95
95
HDFSSpace Requirements
96
Additional Requirements for Backing Up Data Stored on HDFS
96
How the HDFSStorage Location Stores Data

What You Can Store on HDFS
97
97
Page 6 of 135

Contents
What HDFSStorage LocationsCannot Do
97
Creating an HDFS Storage Location
98
Creating a Storage Location Using HPVertica for SQLon Hadoop
99
Adding HDFSStorage Locations to New Nodes
100
Creating a Storage Policy for HDFSStorage Locations
100
Storing an Entire Table in an HDFS Storage Location
101
Storing Table Partitions in HDFS
101
Moving Partitions to a Table Stored on HDFS
103
Backing Up HP Vertica Storage Locations for HDFS
104
Configuring HPVertica to Restore HDFSStorage Locations
105
Configuration Overview
106
106
Finding Your Hadoop Distribution's Package Repository
107
Configuring HPVertica Nodes to Access the Hadoop DistributionsPackage

Repository
107
Installing the Required Hadoop Packages
109
Setting Configuration Parameters
110
Setting Kerberos Parameters
111
Confirming that distcp Runs
112
Troubleshooting
113
Configuring Hadoop and HP Vertica to Enable Backup of HDFS Storage
113
Granting Superuser Status on Hortonworks 2.1
114
Granting Superuser Status on Cloudera 5.1
114
Manually Enabling Snapshotting for a Directory
115
Additional Requirements for Kerberos
115
Testing the Database Account's Ability to Make HDFSDirectories Snapshottable 116

Performing Backups Containing HDFS Storage Locations
116
Page 7 of 135

Contents
Removing HDFS Storage Locations

Removing Existing Data from an HDFSStorage Location
116
117
Moving Data to Another Storage Location
118
Clearing Storage Policies
119
Changing the Usage of HDFSStorage Locations
121
Dropping an HDFS Storage Location
122
Removing Storage Location Files fromHDFS
122
Removing Backup Snapshots
122
Removing the Storage Location Directories
123
Troubleshooting HDFS Storage Locations
124
HDFSStorage Disk Consumption
124
Kerberos Authentication When Creating a Storage Location
126
Backup or Restore Fails When Using Kerberos
126
Integrating HP Vertica with the MapR Distribution of Hadoop
128
Using Kerberos with Hadoop
129
How Vertica uses Kerberos With Hadoop
129
User Authentication
129
HP Vertica Authentication
130
See Also
131
Configuring Kerberos
132
Prerequisite:Setting Up Users and the Keytab File
132
HCatalog Connector
132
HDFS Connector
132
HDFS Storage Location
133
Token Expiration
133
See Also
133
We appreciate your feedback!
135
Page 8 of 135

How HP Vertica and Apache Hadoop Work Together
How HP Vertica and Apache Hadoop Work

Together
Apache Hadoop is an open-source platform for distributed computing. Like HP Vertica, it
harnesses the power of a computer cluster to process data. Hadoop forms an ecosystem that
comprises many different components, each of which provides a different data processing
capability. The Hadoop components that HP Vertica integrates with are:
l
MapReduceA programming framework for parallel data processing.
The Hadoop Distributed Filesystem (HDFS)A fault-tolerant data storage system that takes
network structure into account when storing data.
HiveA data warehouse that provides the ability to query data stored in Hadoop.
HCatalogA component that makes Hive metadata available to applications outside of

Hadoop.
Some HDFS configurations use Kerberos authentication. HP Vertica integrates with Kerberos to
access HDFS data if needed. See Using Kerberos with Hadoop.
While HP Vertica and Apache Hadoop can work together, HP Vertica contains many of the same
data processing features as Hadoop. For example, using HP Vertica's flex tables, you can
manipulate semi-structured data, which is a task commonly associated with Hadoop. You can also
create user-defined extensions (UDxs) using HP Vertica's SDKthat perform tasks similar to
Hadoop's MapReduce jobs.
HP Vertica's Integrations for Apache Hadoop

Hewlett-Packard supplies several tools for integrating HP Vertica and Hadoop:
l
The HPVertica Connector for Apache Hadoop MapReduce lets you create Hadoop MapReduce
jobs that retrieve data from HP Vertica. These jobs can also insert data into HP Vertica.
The HP Vertica Connector for HDFSlets HP Vertica access data stored in HDFS.
The HP Vertica Connector for HCatalog lets HP Vertica query data stored in a Hive database
the same way you query data stored natively in an HP Vertica schema.
The HP Vertica Storage Location for HDFS lets HP Vertica store data on HDFS. If you are using
the HPVertica for SQLon Hadoop product, this is how all of your data is stored.
Page 9 of 135

Cluster Layout
Cluster Layout
In the Enterprise Edition product, your HP Vertica and Hadoop clusters must be set up on separate
nodes, ideally connected by a high-bandwidth network connection. This is different from the
configuration for HPVertica for SQLon Hadoop, in which HP Vertica nodes are co-located on
Hadoop nodes.
The following figure illustrates the Enterprise Edition configuration:
The network is a key performance component of any well-configured cluster. When HP Vertica
stores data to HDFS it writes and reads data across the network.
The layout shown in the figure calls for two networks, and there are benefits to adding a third:
l
Database Private Network: HP Vertica uses a private network for command and control and
moving data between nodes in support of its database functions. In some networks, the
command and control and passing of data are split across two networks.
Database/Hadoop Shared Network: Each HP Vertica node must be able to connect to each
Hadoop data node and the Name Node. Hadoop best practices generally require a dedicated
network for the Hadoop cluster. This is not a technical requirement, but a dedicated network
Page 10 of 135

Cluster Layout
improves Hadoop performance. HP Vertica and Hadoop should share the dedicated Hadoop
network.
l
Optional Client Network: Outside clients may access the clustered networks through a client
network. This is not an absolute requirement, but the use of a third network that supports client
connections to either HP Vertica or Hadoop can improve performance. If the configuration does
not support a client network, than client connections should use the shared network.
Page 11 of 135


HP Vertica provides several ways to interact with data stored in Hadoop. The following table
summarizes when you should consider using each:
If you want to...
l
...Use this interface
Create a MapReduce job that can read data from
HP Vertica Connector for Apache
or write data to HP Vertica.
Hadoop MapReduce
Create a streaming MapReduce job that reads

data from or writes data to HP Vertica
Create a Pig script that can read from or store

data into HP Vertica.
Load structured data from the Hadoop Distributed
HP Vertica Connector for HDFS
Filesystem (HDFS) into HP Vertica

l
Create an external table based on structured data

stored in HDFS
Query or explore data stored in Apache Hive
HP Vertica Connector for HCatalog
without copying it into your database

l
Query data stored in Parquet format
Integrate HP Vertica and Hive metadata in a

single query
If using the HPVertica for SQLon Hadoop
product, query data stored in Optimized Row

Columnar (ORC) files in HDFS (faster than
HCatalog Connector but does not benefit from
Hive data)
Page 12 of 135

If using the Enterprise Edition product, store
HP Vertica Storage Location for HDFS
lower-priority data in HDFS (while your higherpriority data is stored locally)

l
If using the HPVertica for SQLon Hadoop

product, store data in the database native format
(ROS files) for improved query execution
(compared to Hive)
If you are using the HPVertica for SQLon Hadoop product, which is licensed for HDFSdata only,
there are some additional considerations. See Choosing How to Connect HP Vertica To HDFS.
Page 13 of 135

Page 14 of 135

Using the HP Vertica Connector for Hadoop

MapReduce
The HP Vertica Connector for Hadoop MapReduce lets you create Hadoop MapReduce jobs that
can read data from and write data to HP Vertica. You commonly use it when:
l
You need to incorporate data from HP Vertica into your MapReduce job. For example, suppose
you are using Hadoop's MapReduce to process web server logs. You may want to access
sentiment analysis data stored in HP Vertica using Pulse to try to correlate a website visitor with
social media activity.
You are using Hadoop MapReduce to refine data on which you want to perform analytics. You
can have your MapReduce job directly insert data into HP Vertica where you can analyze it in
real time using all of HP Vertica's features.
HP Vertica Connector for Hadoop Features

The HP Vertica Connector for Hadoop Map Reduce:
l
gives Hadoop access to data stored in HP Vertica.
lets Hadoop store its results in HP Vertica. The Connector can create a table for the Hadoop
data if it does not already exist.
lets applications written in Apache Pig access and store data in HP Vertica.
works with Hadoop streaming.
The Connector runs on each node in the Hadoop cluster, so the Hadoop nodes and HP Vertica
nodes communicate with each other directly. Direct connections allow data to be transferred in
parallel, dramatically increasing processing speed.
The Connector is written in Java, and is compatible with all platforms supported by Hadoop.
Note: To prevent Hadoop from potentially inserting multiple copies of data into HP Vertica, the
HP Vertica Connector for Hadoop Map Reduce disables Hadoop's speculative execution
feature.
Page 15 of 135

Prerequisites
Before you can use the HP Vertica Connector for Hadoop MapReduce, you must install and
configure Hadoop and be familiar with developing Hadoop applications. For details on installing and
using Hadoop, please see the Apache Hadoop Web site.
See HP Vertica 7.1.x Supported Platforms for a list of the versions of Hadoop and Pig that the
connector supports.
Hadoop and HP Vertica Cluster Scaling

When using the connector for MapReduce, nodes in the Hadoop cluster connect directly to HP
Vertica nodes when retrieving or storing data. These direct connections allow the two clusters to
transfer large volumes of data in parallel. If the Hadoop cluster is larger than the HP Vertica cluster,
this parallel data transfer can negatively impact the performance of the HP Vertica database.
To avoid performance impacts on your HP Vertica database, ensure that your Hadoop cluster
cannot overwhelm your HP Vertica cluster. The exact sizing of each cluster depends on how fast
your Hadoop cluster generates data requests and the load placed on the HP Vertica database by
queries from other sources. A good rule of thumb to follow is for your Hadoop cluster to be no larger
than your HP Vertica cluster.
Installing the Connector

Follow these steps to install the HP Vertica Connector for Hadoop Map Reduce:
If you have not already done so, download theHP Vertica Connector for Hadoop Map Reduce
installation package from the myVertica portal. Be sure to download the package that is compatible
with your version of Hadoop. You can find your Hadoop version by issuing the following command
on a Hadoop node:
# hadoop version
You will also need a copy of the HP Vertica JDBC driver which you can also download from the
myVertica portal.
You need to perform the following steps on each node in your Hadoop cluster:
1. Copy the HP Vertica Connector for Hadoop Map Reduce .zip archive you downloaded to a
temporary location on the Hadoop node.
Page 16 of 135

2. Copy the HP Vertica JDBC driver .jar file to the same location on your node. If you haven't
already, you can download this driver from the myVertica portal.
3. Unzip the connector .zip archive into a temporary directory. On Linux, you usually use the
command unzip.
4. Locate the Hadoop home directory (the directory where Hadoop is installed). The location of
this directory depends on how you installed Hadoop (manual install versus a package supplied
by your Linux distribution or Cloudera). If you do not know the location of this directory, you can
try the following steps:
n
See if the HADOOP_HOME environment variable is set by issuing the command echo
$HADOOP_HOME on the command line.
See if Hadoop is in your path by typing hadoop classpath on the command line. If it is,
this command lists the paths of all the jar files used by Hadoop, which should tell you the
location of the Hadoop home directory.
If you installed using a .deb or .rpm package, you can look in /usr/lib/hadoop, as this is
often the location where these packages install Hadoop.
5. Copy the file hadoop-vertica.jar from the directory where you unzipped the connector
archive to the lib subdirectory in the Hadoop home directory.
6. Copy the HP Vertica JDBC driver file (vertica-jdbc-x.x.x.jar) to the lib subdirectory in
the Hadoop home directory ($HADOOP_HOME/lib).
7. Edit the $HADOOP_HOME/conf/hadoop-env.sh file, and find the lines:
# Extra Java CLASSPATH elements.
# export HADOOP_CLASSPATH=
Optional.
Uncomment the export line by removing the hash character (#) and add the absolute path of
the JDBC driver file you copied in the previous step. For example:
export HADOOP_CLASSPATH=$HADOOP_HOME/lib/vertica-jdbc-x.x.x.jar
This environment variable ensures that Hadoop can find the HP Vertica JDBC driver.
8. Also in the $HADOOP_HOME/conf/hadoop-env.sh file, ensure that the JAVA_HOME environment
variable is set to your Java installation.
Page 17 of 135

9. If you want your application written in Pig to be able to access HP Vertica, you need to:
a. Locate the Pig home directory. Often, this directory is in the same parent directory as the
Hadoop home directory.
b. Copy the file named pig-vertica.jar from the directory where you unpacked the
connector .zip file to the lib subdirectory in the Pig home directory.
c. Copy the HP Vertica JDBC driver file (vertica-jdbc-x.x.x.jar) to the lib subdirectory
in the Pig home directory.
Page 18 of 135

Accessing HP Vertica Data From Hadoop

You need to follow three steps to have Hadoop fetch data from HP Vertica:
l
Set the Hadoop job's input format to be HP VerticaInputFormat.
Give the HP VerticaInputFormat class a query to be used to extract data from HP Vertica.
Create a Mapper class that accepts VerticaRecord objects as input.
The following sections explain each of these steps in greater detail.
Selecting VerticaInputFormat
The first step to reading HP Vertica data from within a Hadoop job is to set its input format. You
usually set the input format within the run() method in your Hadoop application's class. To set up
the input format, pass the job.setInputFormatClass method the VerticaInputFormat.class, as
follows:
public int run(String[] args) throws Exception {
// Set up the configuration and job objects
Configuration conf = getConf();
Job job = new Job(conf);
(later in the code)

// Set the input format to retrieve data from
// Vertica.
job.setInputFormatClass(VerticaInputFormat.class);
Setting the input to the VerticaInputFormat class means that the map method will get
VerticaRecord objects as its input.
Page 19 of 135

Setting the Query to Retrieve Data From HP Vertica

A Hadoop job that reads data from your HP Vertica database has to execute a query that selects its
input data. You pass this query to your Hadoop application using the setInput method of the
VerticaInputFormat class. The HP Vertica Connector for Hadoop Map Reduce sends this query
to the Hadoop nodes which then individually connect to HP Vertica nodes to run the query and get
their input data.
A primary consideration for this query is how it segments the data being retrieved from HP Vertica.
Since each node in the Hadoop cluster needs data to process, the query result needs to be
segmented between the nodes.
There are three formats you can use for the query you want your Hadoop job to use when retrieving
input data. Each format determines how the query's results are split across the Hadoop cluster.
These formats are:
l
A simple, self-contained query.
A parameterized query along with explicit parameters.
A parameterized query along with a second query that retrieves the parameter values for the first
query from HP Vertica.
The following sections explain each of these methods in detail.
Using a Simple Query to Extract Data From HP Vertica

The simplest format for the query that Hadoop uses to extract data from HP Vertica is a selfcontained hard-coded query. You pass this query in a String to the setInput method of the
VerticaInputFormat class. You usually make this call in the run method of your Hadoop job
class. For example, the following code retrieves the entire contents of the table named allTypes.
// Sets the query to use to get the data from the Vertica database.
// Simple query with no parameters
VerticaInputFormat.setInput(job,
"SELECT * FROM allTypes ORDER BY key;");
The query you supply must have an ORDER BY clause, since the HP Vertica Connector for
Hadoop Map Reduce uses it to figure out how to segment the query results between the Hadoop
nodes. When it gets a simple query, the connector calculates limits and offsets to be sent to each
node in the Hadoop cluster, so they each retrieve a portion of the query results to process.
Page 20 of 135

Having Hadoop use a simple query to retrieve data from HP Vertica is the least efficient method,
since the connector needs to perform extra processing to determine how the data should be
segmented across the Hadoop nodes.
Using a Parameterized Query and Parameter Lists

You can have Hadoop retrieve data from HP Vertica using a parametrized query, to which you
supply a set of parameters. The parameters in the query are represented by a question mark (?).
You pass the query and the parameters to the setInput method of the VerticaInputFormat class.
You have two options for passing the parameters: using a discrete list, or by using a Collection
object.
Using a Discrete List of Values

To pass a discrete list of parameters for the query, you include them in the setInput method call in
a comma-separated list of string values, as shown in the next example:
// Simple query with supplied parameters
"SELECT * FROM allTypes WHERE key = ?", "1001", "1002", "1003");
The HP Vertica Connector for Hadoop Map Reduce tries to evenly distribute the query and
parameters among the nodes in the Hadoop cluster. If the number of parameters is not a multiple of
the number of nodes in the cluster, some nodes will get more parameters to process than others.
Once the connector divides up the parameters among the Hadoop nodes, each node connects to a
host in the HP Vertica database and executes the query, substituting in the parameter values it
received.
This format is useful when you have a discrete set of parameters that will not change over time.
However, it is inflexible because any changes to the parameter list requires you to recompile your
Hadoop job. An added limitation is that the query can contain just a single parameter, because the
setInput method only accepts a single parameter list. The more flexible way to use parameterized
queries is to use a collection to contain the parameters.
Using a Collection Object

The more flexible method of supplying the parameters for the query is to store them into a
Collection object, then include the object in the setInput method call. This method allows you to
build the list of parameters at run time, rather than having them hard-coded. You can also use
multiple parameters in the query, since you will pass a collection of ArrayList objects to setInput
Page 21 of 135

statement. Each ArrayList object supplies one set of parameter values, and can contain values
for each parameter in the query.
The following example demonstrates using a collection to pass the parameter values for a query
containing two parameters. The collection object passed to setInput is an instance of the HashSet
class. This object contains four ArrayList objects added within the for loop. This example just
adds dummy values (the loop counter and the string "FOUR"). In your own application, you usually
calculate parameter values in some manner before adding them to the collection.
Note: If your parameter values are stored in HP Vertica, you should specify the parameters
using a query instead of a collection. See Using a Query to Retrieve Parameters for a
Parameterized Query for details.
// Collection to hold all of the sets of parameters for the query.

Collection<List<Object>> params = new HashSet<List<Object>>() {
};
// Each set of parameters lives in an ArrayList. Each entry
// in the list supplies a value for a single parameter in
// the query. Here, ArrayList objects are created in a loop
// that adds the loop counter and a static string as the
// parameters. The ArrayList is then added to the collection.
for (int i = 0; i < 4; i++) {
ArrayList<Object> param = new ArrayList<Object>();
param.add(i);
param.add("FOUR");
params.add(param);
}
"select * from allTypes where key = ? AND NOT varcharcol = ?",
params);
Scaling Parameter Lists for the Hadoop Cluster

Whenever possible, make the number of parameter values you pass to the HP Vertica Connector
for Hadoop Map Reduce equal to the number of nodes in the Hadoop cluster because each
parameter value is assigned to a single Hadoop node. This ensures that the workload is spread
across the entire Hadoop cluster. If you supply fewer parameter values than there are nodes in the
Hadoop cluster, some of the nodes will not get a value and will sit idle. If the number of parameter
values is not a multiple of the number of nodes in the cluster, Hadoop randomly assigns the extra
values to nodes in the cluster. It does not perform schedulingit does not wait for a nodes finish its
task and become free before assigning additional tasks. In this case, a node could become a
bottleneck if it is assigned the longer-running portions of the job.
Page 22 of 135

In addition to the number of parameters in the query, you should make the parameter values yield
roughly the same number of results. Ensuring each [parameter yields the same number of results
helps prevent a single node in the Hadoop cluster from becoming a bottleneck by having to process
more data than the other nodes in the cluster.
Using a Query to Retrieve Parameter Values for a

Parameterized Query
You can pass the HP Vertica Connector for Hadoop Map Reduce a query to extract the parameter
values for a parameterized query. This query must return a single column of data that is used as
parameters for the parameterized query.
To use a query to retrieve the parameter values, supply the VerticaInputFormat class's setInput
method with the parameterized query and a query to retrieve parameters. For example:
// Query using a parameter that is supplied by another query
"select * from allTypes where key = ?",
"select distinct key from regions");
When it receives a query for parameters, the connector runs the query itself, then groups the results
together to send out to the Hadoop nodes, along with the parameterized query. The Hadoop nodes
then run the parameterized query using the set of parameter values sent to them by the connector.
Writing a Map Class That Processes HP Vertica Data

Once you have set up your Hadoop application to read data from HP Vertica, you need to create a
Map class that actually processes the data. Your Map class's map method receives LongWritable
values as keys and VerticaRecord objects as values. The key values are just sequential numbers
that identify the row in the query results. The VerticaRecord class represents a single row from the
result set returned by the query you supplied to the VerticaInput.setInput method.
Working with the VerticaRecord Class

Your map method extracts the data it needs from the VerticaRecord class. This class contains
three main methods you use to extract data from the record set:
l
get retrieves a single value, either by index value or by name, from the row sent to the map
method.
Page 23 of 135

getOrdinalPosition takes a string containing a column name and returns the column's
number.
getType returns the data type of a column in the row specified by index value or by name. This
method is useful if you are unsure of the data types of the columns returned by the query. The
types are stored as integer values defined by the java.sql.Types class.
The following example shows a Mapper class and map method that accepts VerticaRecord
objects. In this example, no real work is done. Instead two values are selected as the key and value
to be passed on to the reducer.
public static class Map extends
Mapper<LongWritable, VerticaRecord, Text, DoubleWritable> {
// This mapper accepts VerticaRecords as input.
public void map(LongWritable key, VerticaRecord value, Context context)
throws IOException, InterruptedException {
//
//
//
if
In your mapper, you would do actual processing here.

This simple example just extracts two values from the row of
data and passes them to the reducer as the key and value.
(value.get(3) != null && value.get(0) != null) {
context.write(new Text((String) value.get(3)),
new DoubleWritable((Long) value.get(0)));
}
}
}
If your Hadoop job has a reduce stage, all of the map method output is managed by Hadoop. It is not
stored or manipulated in any way by HP Vertica. If your Hadoop job does not have a reduce stage,
and needs to store its output into HP Vertica, your map method must output its keys as Text objects
and values as VerticaRecord objects.
Page 24 of 135

Writing Data to HP Vertica From Hadoop

There are three steps you need to take for your Hadoop application to store data in HP Vertica:
l
Set the output value class of your Hadoop job to VerticaRecord.
Set the details of the HP Vertica table where you want to store your data in the
VerticaOutputFormat class.
Create a Reduce class that adds data to a VerticaRecord object and calls its write method to
store the data.
The following sections explain these steps in more detail.
Configuring Hadoop to Output to HP Vertica

To tell your Hadoop application to output data to HP Vertica you configure your Hadoop application
to output to the HP Vertica Connector for Hadoop Map Reduce. You will normally perform these
steps in your Hadoop application's run method. There are three methods that need to be called in
order to set up the output to be sent to the connector and to set the output of the Reduce class, as
shown in the following example:
// Set the output format of Reduce class. It will
// output VerticaRecords that will be stored in the
// database.
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(VerticaRecord.class);
// Tell Hadoop to send its output to the Vertica
// HP Vertica Connector for Hadoop Map Reduce.
job.setOutputFormatClass(VerticaOutputFormat.class);
The call to setOutputValueClass tells Hadoop that the output of the Reduce.reduce method is a
VerticaRecord class object. This object represents a single row of an HP Vertica database table.
You tell your Hadoop job to send the data to the connector by setting the output format class to
VerticaOutputFormat.
Defining the Output Table

Call the VerticaOutputFormat.setOutput method to define the table that will hold the Hadoop
application output:
VerticaOutputFormat.setOutput(jobObject, tableName, [truncate, ["columnName1 dataType1"
[,"columnNamen dataTypen" ...]] );
Page 25 of 135

jobObject
The Hadoop job object for your application.
tableName
The name of the table to store Hadoop's output. If this table does not
exist, the HP Vertica Connector for Hadoop Map Reduce
automatically creates it. The name can be a full
database.schema.table reference.
truncate
A Boolean controlling whether to delete the contents of tableName if it

already exists. If set to true, any existing data in the table is deleted
before Hadoop's output is stored. If set to false or not given, the
Hadoop output is added to the existing data in the table.
"columnName1 dataType1"
The table column definitions, where columnName1 is the column name

and dataType1 the SQL data type. These two values are separated by
a space. If not specified, the existing table is used.
The first two parameters are required. You can add as many column definitions as you need in your
output table.
You usually call the setOutput method in your Hadoop class's run method, where all other setup
work is done. The following example sets up an output table named mrtarget that contains 8
columns, each containing a different data type:
// Sets the output format for storing data in Vertica. It defines the
// table where data is stored, and the columns it will be stored in.
VerticaOutputFormat.setOutput(job, "mrtarget", true, "a int",
"b boolean", "c char(1)", "d date", "f float", "t timestamp",
"v varchar", "z varbinary");
If the truncate parameter is set to true for the method call, and target table already exists in the HP
Vertica database, the connector deletes the table contents before storing the Hadoop job's output.
Note: If the table already exists in the database, and the method call truncate parameter is set
to false, the HP Vertica Connector for Hadoop Map Reduce adds new application output to
the existing table. However, the connector does not verify that the column definitions in the
existing table match those defined in the setOutput method call. If the new application output
values cannot be converted to the existing column values, your Hadoop application can throw
casting exceptions.
Page 26 of 135

Writing the Reduce Class

Once your Hadoop application is configured to output data to HP Vertica and has its output table
defined, you need to create the Reduce class that actually formats and writes the data for storage in
HP Vertica.
The first step your Reduce class should take is to instantiate a VerticaRecord object to hold the
output of the reduce method. This is a little more complex than just instantiating a base object,
since the VerticaRecord object must have the columns defined in it that match the out table's
columns (see Defining the Output Table for details). To get the properly configured VerticaRecord
object, you pass the constructor the configuration object.
You usually instantiate the VerticaRecord object in your Reduce class's setup method, which
Hadoop calls before it calls the reduce method. For example:
// Sets up the output record that will be populated by
// the reducer and eventually written out.
public void setup(Context context) throws IOException,
InterruptedException {
super.setup(context);
try {
// Instantiate a VerticaRecord object that has the proper
// column definitions. The object is stored in the record
// field for use later.
record = new VerticaRecord(context.getConfiguration());
} catch (Exception e) {
throw new IOException(e);
}
}
Storing Data in the VerticaRecord

Your reduce method starts the same way any other Hadoop reduce method doesit processes its
input key and value, performing whatever reduction task your application needs. Afterwards, your
reduce method adds the data to be stored in HP Vertica to the VerticaRecord object that was
instantiated earlier. Usually you use the set method to add the data:
VerticaRecord.set(column, value);
column
The column to store the value in. This is either an integer (the column number) or a
String (the column name, as defined in the table definition).
Note: The set method throws an exception if you pass it the name of a column that does
not exist. You should always use a try/catch block around any set method call that uses
a column name.
Page 27 of 135

value
The value to store in the column. The data type of this value must match the definition of
the column in the table.
Note: If you do not have the set method validate that the data types of the value and the
column match, the HP Vertica Connector for Hadoop Map Reduce throws a
ClassCastException if it finds a mismatch when it tries to commit the data to the database.
This exception causes a rollback of the entire result. By having the set method validate the
data type of the value, you can catch and resolve the exception before it causes a rollback.
In addition to the set method, you can also use the setFromString method to have the HP Vertica
Connector for Hadoop Map Reduce convert the value from String to the proper data type for the
column:
VerticaRecord.setFromString(column, "value");
column
The column number to store the value in, as an integer.
value
A String containing the value to store in the column. If the String cannot be converted
to the correct data type to be stored in the column, setFromString throws an exception
(ParseException for date values, NumberFormatException numeric values).
Your reduce method must output a value for every column in the output table. If you want a column
to have a null value you must explicitly set it.
After it populates the VerticaRecord object, your reduce method calls the Context.write
method, passing it the name of the table to store the data in as the key, and the VerticaRecord
object as the value.
The following example shows a simple Reduce class that stores data into HP Vertica. To make the
example as simple as possible, the code doesn't actually process the input it receives, and instead
just writes dummy data to the database. In your own application, you would process the key and
values into data that you then store in the VerticaRecord object.
public static class Reduce extends
Reducer<Text, DoubleWritable, Text, VerticaRecord> {
// Holds the records that the reducer writes its values to.
VerticaRecord record = null;

try {
Page 28 of 135

// Need to call VerticaOutputFormat to get a record object

// that has the proper column definitions.
}
}
// The reduce method.

public void reduce(Text key, Iterable<DoubleWritable> values,
Context context) throws IOException, InterruptedException {
// Ensure that the record object has been set up properly. This is
// where the results are written.
if (record == null) {
throw new IOException("No output record found");
}
// In this part of your application, your reducer would process the
// key and values parameters to arrive at values that you want to
// store into the database. For simplicity's sake, this example
// skips all of the processing, and just inserts arbitrary values
// into the database.
//
// Use the .set method to store a value in the record to be stored
// in the database. The first parameter is the column number,
// the second is the value to store.
//
// Column numbers start at 0.
//
// Set record 0 to an integer value, you
// should always use a try/catch block to catch the exception.
try {
record.set(0, 125);
// Handle the improper data type here.
e.printStackTrace();
}
// You can also set column values by name rather than by column
// number. However, this requires a try/catch since specifying a
// non-existent column name will throw an exception.
try {
// The second column, named "b", contains a Boolean value.
record.set("b", true);
// Handle an improper column name here.
}
// Column 2 stores a single char value.
record.set(2, 'c');
// Column 3 is a date. Value must be a java.sql.Date type.
Page 29 of 135

record.set(3, new java.sql.Date(

Calendar.getInstance().getTimeInMillis()));
// You can use the setFromString method to convert a string

// value into the proper data type to be stored in the column.
// You need to use a try...catch block in this case, since the
// string to value conversion could fail (for example, trying to
// store "Hello, World!" in a float column is not going to work).
try {
record.setFromString(4, "234.567");
} catch (ParseException e) {
// Thrown if the string cannot be parsed into the data type
// to be stored in the column.
}
// Column 5 stores a timestamp
record.set(5, new java.sql.Timestamp(
// Column 6 stores a varchar
record.set(6, "example string");
// Column 7 stores a varbinary
record.set(7, new byte[10]);
// Once the columns are populated, write the record to store
// the row into the database.
context.write(new Text("mrtarget"), record);
}
}
Passing Parameters to the HP Vertica Connector

for Hadoop Map Reduce At Run Time
Specifying the Location of the Connector .jar File
Recent versions of Hadoop fail to find the HP Vertica Connector for Hadoop Map Reduce classes
automatically, even though they are included in the Hadoop lib directory. Therefore, you need to
manually tell Hadoop where to find the connector .jar file using the libjars argument:
hadoop jar myHadoopApp.jar com.myorg.hadoop.myHadoopApp \
-libjars $HADOOP_HOME/lib/hadoop-vertica.jar \
. . .
Specifying the Database Connection Parameters

You need to pass connection parameters to the HP Vertica Connector for Hadoop Map Reduce
when starting your Hadoop application, so it knows how to connect to your database. At a
minimum, these parameters must include the list of hostnames in the HP Vertica database cluster,
Page 30 of 135

the name of the database, and the user name. The common parameters for accessing the database
appear in the following table. Usually, you will only need the basic parameters listed in this table in
order to start your Hadoop application.
Parameter
Description
Required Default
mapred.vertica.hostnames
A comma-separated list of the names or IP
Yes
none
5433
addresses of the hosts in the HP Vertica

cluster. You should list all of the nodes in the
cluster here, since individual nodes in the
Hadoop cluster connect directly with a
randomly assigned host in the cluster.
The hosts in this cluster are used for both
reading from and writing data to the HP Vertica
database, unless you specify a different output
database (see below).
mapred.vertica.port
The port number for the HP Vertica database.
No
mapred.vertica.database
The name of the database the Hadoop
Yes
application should access.

mapred.vertica.username
The username to use when connecting to the
Yes
database.
mapred.vertica.password
The password to use when connecting to the
No
empty
database.
You pass the parameters to the connector using the -D command line switch in the command you
use to start your Hadoop application. For example:
hadoop jar myHadoopApp.jar com.myorg.hadoop.myHadoopApp \
-Dmapred.vertica.hostnames=Vertica01,Vertica02,Vertica03,Vertica04 \
-Dmapred.vertica.port=5433 -Dmapred.vertica.username=exampleuser \
-Dmapred.vertica.password=password123 -Dmapred.vertica.database=ExampleDB
Parameters for a Separate Output Database

The parameters in the previous table are all you need if your Hadoop application accesses a single
HP Vertica database. You can also have your Hadoop application read from one HP Vertica
database and write to a different HP Vertica database. In this case, the parameters shown in the
previous table apply to the input database (the one Hadoop reads data from). The following table
Page 31 of 135

lists the parameters that you use to supply your Hadoop application with the connection information
for the output database (the one it writes its data to). None of these parameters is required. If you do
not assign a value to one of these output parameters, it inherits its value from the input database
parameters.
Parameter
Description
Default
mapred.vertica.hostnames.output
A comma-separated list of the names or IP
Input
addresses of the hosts in the output HP Vertica hostnames

cluster.
mapred.vertica.port.output
The port number for the output HP Vertica
5433
database.
mapred.vertica.database.output
The name of the output database.
Input
database
name
mapred.vertica.username.output
The username to use when connecting to the
Input
output database.
database
user name
mapred.vertica.password.output
The password to use when connecting to the
Input
output database.
database
password
Example HP Vertica Connector for Hadoop Map

Reduce Application
This section presents an example of using the HP Vertica Connector for Hadoop Map Reduce to
retrieve and store data from an HP Vertica database. The example pulls together the code that has
appeared on the previous topics to present a functioning example.
This application reads data from a table named allTypes. The mapper selects two values from this
table to send to the reducer. The reducer doesn't perform any operations on the input, and instead
inserts arbitrary data into the output table named mrtarget.
package com.vertica.hadoop;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Collection;
import java.util.HashSet;
Page 32 of 135

import java.util.Iterator;
import java.util.List;
import java.math.BigDecimal;
import java.sql.Date;
import java.sql.Timestamp;
// Needed when using the setFromString method, which throws this exception.
import java.text.ParseException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.vertica.hadoop.VerticaConfiguration;
import com.vertica.hadoop.VerticaInputFormat;
import com.vertica.hadoop.VerticaOutputFormat;
import com.vertica.hadoop.VerticaRecord;
// This is the class that contains the entire Hadoop example.
public class VerticaExample extends Configured implements Tool {
public static class Map extends
Mapper<LongWritable, VerticaRecord, Text, DoubleWritable> {
// This mapper accepts VerticaRecords as input.
public void map(LongWritable key, VerticaRecord value, Context context)
throws IOException, InterruptedException {
//
//
//
if
In your mapper, you would do actual processing here.

This simple example just extracts two values from the row of
data and passes them to the reducer as the key and value.
(value.get(3) != null && value.get(0) != null) {
context.write(new Text((String) value.get(3)),
new DoubleWritable((Long) value.get(0)));
}
}
}
public static class Reduce extends

Reducer<Text, DoubleWritable, Text, VerticaRecord> {
// Holds the records that the reducer writes its values to.
VerticaRecord record = null;

try {
// Need to call VerticaOutputFormat to get a record object
// that has the proper column definitions.
Page 33 of 135


}
}
// The reduce method.

public void reduce(Text key, Iterable<DoubleWritable> values,
Context context) throws IOException, InterruptedException {
// Ensure that the record object has been set up properly. This is
// where the results are written.
if (record == null) {
throw new IOException("No output record found");
}
// In this part of your application, your reducer would process the
// key and values parameters to arrive at values that you want to
// store into the database. For simplicity's sake, this example
// skips all of the processing, and just inserts arbitrary values
// into the database.
//
// Use the .set method to store a value in the record to be stored
// in the database. The first parameter is the column number,
// the second is the value to store.
//
// Column numbers start at 0.
//
// Set record 0 to an integer value, you
// should always use a try/catch block to catch the exception.
try {
record.set(0, 125);
// Handle the improper data type here.
}
// You can also set column values by name rather than by column
// number. However, this requires a try/catch since specifying a
// non-existent column name will throw an exception.
try {
// The second column, named "b", contains a Boolean value.
record.set("b", true);
// Handle an improper column name here.
}
// Column 2 stores a single char value.
record.set(2, 'c');
// Column 3 is a date. Value must be a java.sql.Date type.
Page 34 of 135

record.set(3, new java.sql.Date(

//
//
//
//
//
You can use the setFromString method to convert a string

value into the proper data type to be stored in the column.
You need to use a try...catch block in this case, since the
string to value conversion could fail (for example, trying to
store "Hello, World!" in a float column is not going to work).
try {
record.setFromString(4, "234.567");
} catch (ParseException e) {
// Thrown if the string cannot be parsed into the data type
// to be stored in the column.
}
// Column 5 stores a timestamp
record.set(5, new java.sql.Timestamp(
// Column 6 stores a varchar
record.set(6, "example string");
// Column 7 stores a varbinary
record.set(7, new byte[10]);
// Once the columns are populated, write the record to store
// the row into the database.
context.write(new Text("mrtarget"), record);
}
}
@Override
public int run(String[] args) throws Exception {
// Set up the configuration and job objects
Configuration conf = getConf();
Job job = new Job(conf);
conf = job.getConfiguration();
conf.set("mapreduce.job.tracker", "local");
job.setJobName("vertica test");
// Set the input format to retrieve data from
// Vertica.
job.setInputFormatClass(VerticaInputFormat.class);
// Set the output format of the mapper. This is the interim
// data format passed to the reducer. Here, we will pass in a
// Double. The interim data is not processed by Vertica in any
// way.
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(DoubleWritable.class);
// Set the output format of the Hadoop application. It will
// output VerticaRecords that will be stored in the
// database.
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(VerticaRecord.class);
job.setOutputFormatClass(VerticaOutputFormat.class);
job.setJarByClass(VerticaExample.class);
Page 35 of 135

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// Sets the output format for storing data in Vertica. It defines the
// table where data is stored, and the columns it will be stored in.
VerticaOutputFormat.setOutput(job, "mrtarget", true, "a int",
"b boolean", "c char(1)", "d date", "f float", "t timestamp",
"v varchar", "z varbinary");
// Query using a list of parameters.
VerticaInputFormat.setInput(job, "select * from allTypes where key = ?",
"1", "2", "3");
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new VerticaExample(),
args);
System.exit(res);
}
}
Compiling and Running the Example Application

To run the example Hadoop application, you first need to set up the allTypes table that the example
reads as input. To set up the input table, save the following Perl script as MakeAllTypes.pl to a
location on one of your HP Vertica nodes:
#!/usr/bin/perl
open FILE, ">datasource" or die $!;
for ($i=0; $i < 10; $i++) {
print FILE $i . "|" . rand(10000);
print FILE "|one|ONE|1|1999-01-08|1999-02-23 03:11:52.35";
print FILE '|1999-01-08 07:04:37|07:09:23|15:12:34 EST|0xabcd|';
print FILE '0xabcd|1234532|03:03:03' . qq(\n);
}
close FILE;
Then follow these steps:

1. Connect to the node where you saved the MakeAllTypes.pl file.
2. Run the MakeAllTypes.pl file. This will generate a file named datasource in the current
directory.
Note: If your node does not have Perl installed, can can run this script on a system that
Page 36 of 135

does have Perl installed, then copy the datasource file to a database node.
3. On the same node, use vsql to connect to your HP Vertica database.

4. Run the following query to create the allTypes table:
CREATE TABLE allTypes (key identity,intcol integer,
floatcol float,
charcol char(10),
varcharcol varchar,
boolcol boolean,
datecol date,
timestampcol timestamp,
timestampTzcol timestamptz,
timecol time,
timeTzcol timetz,
varbincol varbinary,
bincol binary,
numcol numeric(38,0),
intervalcol interval
);
5. Run the following query to load data from the datasource file into the table:
COPY allTypes COLUMN OPTION (varbincol FORMAT 'hex', bincol FORMAT 'hex')
FROM '/path-to-datasource/datasource' DIRECT;
Replace path-to-datasource with the absolute path to the datasource file located in the
same directory where you ran MakeAllTypes.pl.
Compiling the Example (optional)

The example code presented in this section is based on example code distributed along with the HP
Vertica Connector for Hadoop Map Reduce in the file hadoop-vertica-example.jar. If you just
want to run the example, skip to the next section and use the hadoop-vertica-example.jar file
that came as part of the connector package rather than a version you compiled yourself.
To compile the example code listed in Example HP Vertica Connector for Hadoop Map Reduce
Application, follow these steps:
1. Log into a node on your Hadoop cluster.
2. Locate the Hadoop home directory. See Installing the Connector for tips on how to find this
directory.
Page 37 of 135

3. If it is not already set, set the environment variable HADOOP_HOME to the Hadoop home
directory:
export HADOOP_HOME=path_to_Hadoop_home
If you installed Hadoop using an .rpm or .deb package, Hadoop is usually installed in
/usr/lib/hadoop:
export HADOOP_HOME=/usr/lib/hadoop
4. Save the example source code to a file named VerticaExample.java on your Hadoop node.
5. In the same directory where you saved VerticaExample.java, create a directory named
classes. On Linux, the command is:
mkdir classes
6. Compile the Hadoop example:

javac -classpath \
$HADOOP_HOME/hadoop-core.jar:$HADOOP_HOME/lib/hadoop-vertica.jar \
-d classes VerticaExample.java \
&& jar -cvf hadoop-vertica-example.jar -C classes .
Note: If you receive errors about missing Hadoop classes, check the name of the
hadoop-code.jar file. Most Hadoop installers (including the Cloudera) create a symbolic
link named hadoop-core.jar to a version specific .jar file (such as hadoop-core0.20.203.0.jar). If your Hadoop installation did not create this link, you will have to
supply the .jar file name with the version number.
When the compilation finishes, you will have a file named hadoop-vertica-example.jar in the
same directory as the VerticaExample.java file. This is the file you will have Hadoop run.
Running the Example Application

Once you have compiled the example, run it using the following command line:
hadoop jar hadoop-vertica-example.jar \
com.vertica.hadoop.VerticaExample \
-Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,... \
Page 38 of 135

-Dmapred.vertica.port=portNumber \
-Dmapred.vertica.username=userName \
-Dmapred.vertica.password=dbPassword \
-Dmapred.vertica.database=databaseName
This command tells Hadoop to run your application's .jar file, and supplies the parameters needed
for your application to connect to your HP Vertica database. Fill in your own values for the
hostnames, port, user name, password, and database name for your HP Vertica database.
After entering the command line, you will see output from Hadoop as it processes data that looks
similar to the following:
12/01/11 10:41:19 INFO mapred.JobClient: Running job: job_201201101146_0005
12/01/11 10:41:20 INFO mapred.JobClient: map 0% reduce 0%
12/01/11 10:41:56 INFO mapred.JobClient: Job complete: job_201201101146_0005
12/01/11 10:41:56 INFO mapred.JobClient: Counters: 23
12/01/11 10:41:56 INFO mapred.JobClient:
Job Counters
Launched reduce tasks=1
SLOTS_MILLIS_MAPS=21545
Total time spent by all reduces waiting
after reserving slots (ms)=0
Total time spent by all maps waiting after
reserving slots (ms)=0
Launched map tasks=3
SLOTS_MILLIS_REDUCES=13851
File Output Format Counters
Bytes Written=0
FileSystemCounters
FILE_BYTES_READ=69
HDFS_BYTES_READ=318
FILE_BYTES_WRITTEN=89367
File Input Format Counters
Bytes Read=0
Map-Reduce Framework
Reduce input groups=1
Map output materialized bytes=81
Combine output records=0
Map input records=3
Reduce shuffle bytes=54
Reduce output records=1
Spilled Records=6
Map output bytes=57
Combine input records=0
Map output records=3
SPLIT_RAW_BYTES=318
Reduce input records=3
Note: The version of the example supplied in the Hadoop Connector download package will
Page 39 of 135

produce more output, since it runs several input queries.
Verifying the Results

Once your Hadoop application finishes, you can verify it ran correctly by looking at the mrtarget
table in your HP Vertica database:
Connect to your HP Vertica database using vsql and run the following query:
=> SELECT * FROM mrtarget;
The results should look like this:

a
| b | c |
d
|
f
|
t
|
v
|
z
-----+---+---+------------+---------+-------------------------+----------------+----------------------------------------125 | t | c | 2012-01-11 | 234.567 | 2012-01-11 10:41:48.837 | example string |
\000\000\000\000\000\000\000\000\000\000
(1 row)
Using Hadoop Streaming with the HP Vertica

Connector for Hadoop Map Reduce
Hadoop streaming allows you to create an ad-hoc Hadoop job that uses standard commands (such
as UNIX command-line utilities) for its map and reduce processing. When using streaming, Hadoop
executes the command you pass to it a mapper and breaks each line from its standard output into
key and value pairs. By default, the key and value are separated by the first tab character in the line.
These values are then passed to the standard input to the command that you specified as the
reducer. See the Hadoop wiki's topic on streaming for more information. You can have a streaming
job retrieve data from an HP Vertica database, store data into an HP Vertica database, or both.
Reading Data From HP Vertica in a Streaming Hadoop

Job
To have a streaming Hadoop job read data from an HP Vertica database, you set the inputformat
argument of the Hadoop command line to com.vertica.deprecated.VerticaStreamingInput.
You also need to supply parameters that tell the Hadoop job how to connect to your HP Vertica
database. See Passing Parameters to the HP Vertica Connector for Hadoop Map Reduce At Run
Time for an explanation of these command-line parameters.
Page 40 of 135

Note: The VerticaStreamingInput class is within the deprecated namespace because the
current version of Hadoop (as of 0.20.1) has not defined a current API for streaming. Instead,
the streaming classes conform to the Hadoop version 0.18 API.
In addition to the standard command-line parameters that tell Hadoop how to access your
database, there are additional streaming-specific parameters you need to use that supply Hadoop
with the query it should use to extract data from HP Vertica and other query-related options.
Parameter
Description
Required
Default
mapred.vertica.input.query
The query to use to retrieve data
Yes
none
from the HP Vertica database. See

Setting the Query to Retrieve Data
from HP Vertica for more
information.
mapred.vertica.input.paramquery
A query to execute to retrieve
If query has
parameters for the query given in
parameter and
the .input.query parameter.
no discrete
parameters
supplied
mapred.vertica.query.params
Discrete list of parameters for the
If query has
query.
parameter and
no parameter
query supplied
mapred.vertica.input.delimiter
The character to use for separating No
0xa
column values. The command you

use as a mapper needs to split
individual column values apart
using this delimiter.
mapred.vertica.input.terminator
The character used to signal the
No
0xb
end of a row of data from the query

result.
The following command demonstrates reading data from a table named allTypes. This command
uses the UNIX cat command as a mapper and reducer so it will just pass the contents through.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \
Page 41 of 135

-Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,... \
-Dmapred.vertica.database=ExampleDB \
-Dmapred.vertica.username=ExampleUser \
-Dmapred.vertica.password=password123 \
-Dmapred.vertica.port=5433 \
-Dmapred.vertica.input.query="SELECT key, intcol, floatcol, varcharcol FROM allTypes
ORDER BY key" \
-Dmapred.vertica.input.delimiter=, \
-Dmapred.map.tasks=1 \
-inputformat com.vertica.hadoop.deprecated.VerticaStreamingInput \
-input /tmp/input -output /tmp/output -reducer /bin/cat -mapper /bin/cat
The results of this command are saved in the /tmp/output directory on your HDFS filesystem. On
a four-node Hadoop cluster, the results would be:
# $HADOOP_HOME/bin/hadoop dfs -ls /tmp/output
Found 5 items
drwxr-xr-x
- release supergroup
-rw-r--r-3 release supergroup

# $HADOOP_HOME/bin/hadoop dfs -tail
1
2,1,3165.75558015273,ONE,
5
6,5,1765.76024139635,ONE,
9
10,9,4142.54176256463,ONE,
2
3,2,8257.77313710329,ONE,
6
7,6,7267.69718012601,ONE,
3
4,3,443.188765520475,ONE,
7
8,7,4729.27825566408,ONE,
0
1,0,2456.83076632307,ONE,
4
5,4,9692.61214265391,ONE,
8
9,8,3327.25019418294,ONE,13
2
1003,3,333.0,THREE
3
1004,4,0.0,FOUR
4
1005,5,0.0,FIVE
5
1007,7,0.0,SEVEN
6
1008,8,1.0E36,EIGHT
7
1009,9,-1.0E36,NINE
8
1010,10,0.0,TEN
9
1011,11,11.11,ELEVEN
0 2012-01-19 11:47 /tmp/output/_logs

88 2012-01-19 11:47
58 2012-01-19 11:47
58 2012-01-19 11:47
87 2012-01-19 11:47
/tmp/output/part-00000
1015,15,15.1515,FIFTEEN
Notes
l
Even though the input is coming from HP Vertica, you need to supply the -input parameter to
Hadoop for it to process the streaming job.
Page 42 of 135

The -Dmapred.map.tasks=1 parameter prevents multiple Hadoop nodes from reading the same
data from the database, which would result in Hadoop processing multiple copies of the data.
Writing Data to HP Vertica in a Streaming Hadoop Job

Similar to reading from a streaming Hadoop job, you write data to HP Vertica by setting the
outputformat parameter of your Hadoop command to
com.vertica.deprecated.VerticaStreamingOutput. This class requires key/value pairs, but
the keys are ignored. The values passed to VerticaStreamingOutput are broken into rows and
inserted into a target table. Since keys are ignored, you can use the keys to partition the data for the
reduce phase without affecting HP Vertica's data transfer.
As with reading from HP Vertica, you need to supply parameters that tell the streaming Hadoop job
how to connect to the database. See Passing Parameters to the HP Vertica Connector for Hadoop
Map Reduce At Run Time for an explanation of these command-line parameters. If you are reading
data from one HP Vertica database and writing to another, you need to use the output parameters,
similarly if you were reading and writing to separate databases using a Hadoop application. There
are also additional parameters that configure the output of the streaming Hadoop job, listed in the
following table.
Parameter
Description
Required
Default
mapred.vertica.output.table.name
The name of the table where
Yes
none
Hadoop should store its data.

mapred.vertica.output.table.def
The definition of the table. The
If the table
format is the same as used for
does not
defining the output table for a
already
Hadoop application. See
exist in the
Defining the Output Table for
database
details.
mapred.vertica.output.table.drop
Whether to truncate the table
No
false
No
0x7 (ASCII
before adding data to it.

mapred.vertica.output.delimiter
The character to use for

separating column values.
bell
character)
mapred.vertica.output.terminator
The character used to signal the

end of a row of data..
No
0x8 (ASCII
backspace)
Page 43 of 135

The following example demonstrates reading two columns of data data from an HP Vertica
database table named allTypes and writing it back to the same database in a table named
hadoopout. The command provides the definition for the table, so you do not have to manually
create the table beforehand.
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \
-Dmapred.vertica.output.table.name=hadoopout \
-Dmapred.vertica.output.table.def="intcol integer, varcharcol varchar" \
-Dmapred.vertica.output.table.drop=true \
-Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \
-Dmapred.vertica.input.query="SELECT intcol, varcharcol FROM allTypes ORDER BY key" \
-Dmapred.vertica.input.delimiter=, \
-Dmapred.vertica.output.delimiter=, \
-Dmapred.vertica.input.terminator=0x0a \
-Dmapred.vertica.output.terminator=0x0a \
-inputformat com.vertica.hadoop.deprecated.VerticaStreamingInput \
-outputformat com.vertica.hadoop.deprecated.VerticaStreamingOutput \
-input /tmp/input \
-output /tmp/output \
-reducer /bin/cat \
-mapper /bin/cat
After running this command, you can view the result by querying your database:
=> SELECT * FROM hadoopout;
intcol | varcharcol
--------+-----------1 | ONE
5 | ONE
9 | ONE
2 | ONE
6 | ONE
0 | ONE
4 | ONE
8 | ONE
3 | ONE
7 | ONE
(10 rows)
Loading a Text File From HDFS into HP Vertica

One common task when working with Hadoop and HP Vertica is loading text files from the Hadoop
Distributed File System (HDFS) into an HP Vertica table. You can load these files using Hadoop
streaming, saving yourself the trouble of having to write custom map and reduce classes.
Note: Hadoop streaming is less efficient than a Java map/reduce Hadoop job, since it passes
Page 44 of 135

data through several different interfaces. Streaming is best used for smaller, one-time loads. If
you need to load large amounts of data on a regular basis, you should create a standard
Hadoop map/reduce job in Java or a script in Pig.
For example, suppose you have a text file in the HDFS you want to load contains values delimited
by pipe characters (|), with each line of the file is terminated by a carriage return:
# $HADOOP_HOME/bin/hadoop dfs -cat /tmp/textdata.txt
1|1.0|ONE
2|2.0|TWO
3|3.0|THREE
In this case, the line delimiter poses a problem. You can easily include the column delimiter in the
Hadoop command line arguments. However, it is hard to specify a carriage return in the Hadoop
command line. To get around this issue, you can write a mapper script to strip the carriage return
and replace it with some other character that is easy to enter in the command line and also does not
occur in the data.
Below is an example of a mapper script written in Python. It performs two tasks:
l
Strips the carriage returns from the input text and terminates each line with a tilde (~).
Adds a key value (the string "streaming") followed by a tab character at the start of each line of
the text file. The mapper script needs to do this because the streaming job to read text files skips
the reducer stage. The reducer isn't necessary, since the all of the data being read in text file
should be stored in the HP Vertica tables. However, VerticaStreamingOutput class requires
key and values pairs, so the mapper script adds the key.
#!/usr/bin/python
import sys
for line in sys.stdin.readlines():
# Get rid of carriage returns.
# CR is used as the record terminator by Streaming.jar
line = line.strip();
# Add a key. The key value can be anything.
# The convention is to use the name of the
# target table, as shown here.
sys.stdout.write("streaming\t%s~\n" % line)
The Hadoop command to stream text files from the HDFS into HP Vertica using the above mapper
script appears below.
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \
-Dmapred.reduce.tasks=0 \
Page 45 of 135

-Dmapred.vertica.output.table.name=streaming \
-Dmapred.vertica.output.table.def="intcol integer, floatcol float, varcharcol
varchar" \
-Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \
-Dmapred.vertica.output.delimiter="|" \
-Dmapred.vertica.output.terminator="~" \
-input /tmp/textdata.txt \
-output output \
-mapper "python path-to-script/mapper.py" \
-outputformat com.vertica.hadoop.deprecated.VerticaStreamingOutput
Notes
l
The -Dmapred.reduce-tasks=0 parameter disables the streaming job's reducer stage. It does
not need a reducer since the mapper script processes the data into the format that the
VerticaStreamingOutput class expects.
Even though the VerticaStreamingOutput class is handling the output from the mapper, you
need to supply a valid output directory to the Hadoop command.
The result of running the command is a new table in the HP Vertica database:
=> SELECT * FROM streaming;
intcol | floatcol | varcharcol
--------+----------+-----------3 |
3 | THREE
1 |
1 | ONE
2 |
2 | TWO
(3 rows)
Accessing HP Vertica From Pig

The HP Vertica Connector for Hadoop Map Reduce includes a Java package that lets you access
an HP Vertica database using Pig. You must copy this .jar to somewhere in your Pig installation's
CLASSPATH (see Installing the Connector for details).
Registering the HP Vertica .jar Files

Before it can access HP Vertica, your Pig Latin script must register the HP Vertica-related .jar
files. All of your Pig scripts should start with the following commands:
Page 46 of 135

REGISTER 'path-to-pig-home/lib/vertica-jdbc-7.1..x.jar';
REGISTER 'path-to-pig-home/lib/pig-vertica.jar';
These commands ensure that Pig can locate the HP Vertica JDBC classes, as well as the
interface for the connector.
Reading Data From HP Vertica

To read data from an HP Vertica database, you tell Pig Latin's LOAD statement to use a SQL query
and to use the VerticaLoader class as the load function. Your query can be hard coded, or contain
a parameter. See Setting the Query to Retrieve Data from HP Vertica for details.
Note: You can only use a discrete parameter list or supply a query to retrieve parameter
valuesyou cannot use a collection to supply parameter values as you can from within a
Hadoop application.
The format for calling the VerticaLoader is:
com.vertica.pig.VerticaLoader('hosts','database','port','username','password');
hosts
A comma-separated list of the hosts in the HP Vertica cluster.
database
The name of the database to be queried.
port
The port number for the database.
username
The username to use when connecting to the database.
password
The password to use when connecting to the database. This is the only optional
parameter. If not present, the connector assumes the password is empty.
The following Pig Latin command extracts all of the data from the table named allTypes using a
simple query:
A = LOAD 'sql://{SELECT * FROM allTypes ORDER BY key}' USING
com.vertica.pig.VerticaLoader('Vertica01,Vertica02,Vertica03',
'ExampleDB','5433','ExampleUser','password123');
This example uses a parameter and supplies a discrete list of parameter values:
A = LOAD 'sql://{SELECT * FROM allTypes WHERE key = ?};{1,2,3}' USING
com.vertica.pig.VerticaLoader('Vertica01,Vertica02,Vertica03',
'ExampleDB','5433','ExampleUser','password123');
Page 47 of 135

This final example demonstrates using a second query to retrieve parameters from the HP Vertica
database.
A = LOAD 'sql://{SELECT * FROM allTypes WHERE key = ?};sql://{SELECT DISTINCT key FROM
allTypes}'
USING com.vertica.pig.VerticaLoader('Vertica01,Vertica02,Vertica03','ExampleDB',
'5433','ExampleUser','password123');
Writing Data to HP Vertica

To write data to an HP Vertica database, you tell Pig Latin's STORE statement to save data to a
database table (optionally giving the definition of the table) and to use the VerticaStorer class as
the save function. If the table you specify as the destination does not exist, and you supplied the
table definition, the table is automatically created in your HP Vertica database and the data from the
relation is loaded into it.
The syntax for calling the VerticaStorer is the same as calling VerticaLoader:
com.vertica.pig.VerticaStorer('hosts','database','port','username','password');
The following example demonstrates saving a relation into a table named hadoopOut which must
already exist in the database:
STORE A INTO '{hadoopOut}' USING
com.vertica.pig.VerticaStorer('Vertica01,Vertica02,Vertica03','ExampleDB','5433',
'ExampleUser','password123');
This example shows how you can add a table definition to the table name, so that the table is
created in HP Vertica if it does not already exist:
STORE A INTO '{outTable(a int, b int, c float, d char(10), e varchar, f boolean, g date,
h timestamp, i timestamptz, j time, k timetz, l varbinary, m binary,
n numeric(38,0), o interval)}' USING
com.vertica.pig.VerticaStorer('Vertica01,Vertica02,Vertica03','ExampleDB','5433',
'ExampleUser','password123');
Note: If the table already exists in the database, and the definition that you supply differs from
the table's definition, the table is not dropped and recreated. This may cause data type errors
when data is being loaded.
Page 48 of 135


The Hadoop Distributed File System (HDFS) is the location where Hadoop usually stores its input
and output files. It stores files across the Hadoop cluster redundantly, to keep the files available
even if some nodes are down. HDFS also makes Hadoop more efficient, by spreading file access
tasks across the cluster to help limit I/O bottlenecks.
The HP Vertica Connector for HDFS lets you load files from HDFS into HP Vertica using the
COPY statement. You can also create external tables that access data stored on HDFS as if it
were a native HP Vertica table. The connector is useful if your Hadoop job does not directly store its
data in HP Vertica using the HP Vertica Connector for Hadoop Map Reduce (see Using the HP
Vertica Connector for Hadoop MapReduce) or if you use HDFS to store files and want to process
them using HP Vertica.
Note: The files you load from HDFS using the HP Vertica Connector for HDFS usually have a
delimited format. Column values are separated by a character, such as a comma or a pipe
character (|). This format is the same type used in other files you load with the
COPYstatement. Hadoop MapReduce jobs often output tab-delimited files.
Like the HP Vertica Connector for Hadoop Map Reduce, the HP Vertica Connector for HDFS takes
advantage of the distributed nature of both HP Vertica and Hadoop. Individual nodes in the HP
Vertica cluster connect directly to nodes in the Hadoop cluster when you load multiple files from the
HDFS.
Hadoop splits large files into file segments that it stores on different nodes. The connector directly
retrieves these file segments from the node storing them, rather than relying on the Hadoop cluster
to reassemble the file.
The connector is read-onlyit cannot write data to HDFS.
The HP Vertica Connector for HDFS can connect to a Hadoop cluster through unauthenticated and
Kerberos-authenticated connections.
HP Vertica Connector for HDFS Requirements

The HP Vertica Connector for HDFS connects to the Hadoop file system using WebHDFS, a builtin component of HDFS that provides access to HDFS files to applications outside of Hadoop.
WebHDFS was added to Hadoop in version 1.0, so your Hadoop installation must be version 1.0 or
later..
Page 49 of 135

In addition, the WebHDFS system must be enabled. See your Hadoop distribution's documentation
for instructions on configuring and enabling WebHDFS.
Note: HTTPfs (also known as HOOP) is another method of accessing files stored in an HDFS.
It relies on a separate server process that receives requests for files and retrieves them from
the HDFS. Since it uses a REST API that is compatible with WebHDFS, it could theoretically
work with the connector. However, the connector has not been tested with HTTPfs and HP
does not support using the HP Vertica Connector for HDFS with HTTPfs. In addition, since all
of the files retrieved from HDFS must pass through the HTTPfs server, it is less efficient than
WebHDFS, which lets HP Vertica nodes directly connect to the Hadoop nodes storing the file
blocks.
Kerberos Authentication Requirements

The HP Vertica Connector for HDFS can connect to the Hadoop file system using Kerberos
authentication. To use Kerberos, your connector must meet these additional requirements:
l
Your HP Vertica installation must be Kerberos-enabled.
Your Hadoop cluster must be configured to use Kerberos authentication.
Your connector must be able to connect to the Kerberos-enabled Hadoop Cluster.
The Kerberos server must be running version 5.
The Kerberos server must be accessible from every node in your HP Vertica cluster.
You must have Kerberos principals (users) that map to Hadoop users. You use these principals
to authenticate your HP Vertica users with the Hadoop cluster.
Before you can use the HP Vertica Connector for HDFS with Kerberos you must Install the
Kerberos client and libraries on your HP Vertica cluster.
Testing Your Hadoop WebHDFS Configuration

To ensure that your Hadoop installation's WebHDFS system is configured and running, follow
these steps:
1. Log into your Hadoop cluster and locate a small text file on the Hadoop filesystem. If you do
not have a suitable file, you can create a file named test.txt in the /tmp directory using the
Page 50 of 135

following command:
echo -e "A|1|2|3\nB|4|5|6" | hadoop fs -put - /tmp/test.txt
2. Log into a host in your HP Vertica database using the database administrator account.
3. If you are using Kerberos authentication, authenticate with the Kerberos server using the
keytab file for a user who is authorized to access the file. For example, to authenticate as an
user named exampleuser@MYCOMPANY.COM, use the command:
$ kinit exampleuser@MYCOMPANY.COM -k -t /path/exampleuser.keytab
Where path is the path to the keytab file you copied over to the node. You do not receive any
message if you authenticate successfully. You can verify that you are authenticated by using
the klist command:
$ klistTicket cache: FILE:/tmp/krb5cc_500
Default principal: exampleuser@MYCOMPANY.COM
Valid starting
Expires
Service principal
07/24/13 14:30:19 07/25/13 14:30:19 krbtgt/MYCOMPANY.COM@MYCOMPANY.COM
renew until 07/24/13 14:30:19
4. Test retrieving the file:

n
If you are not using Kerberos authentication, run the following command from the Linux
command line:
curl -i -L
"http://
hadoopNameNode:50070/webhdfs/v1/tmp/test.txt?op=OPEN&user.name=hadoopUserName"
Replacing hadoopNameNode with the hostname or IP address of the name node in your
Hadoop cluster, /tmp/test.txt with the path to the file in the Hadoop filesystem you
located in step 1, and hadoopUserName with the user name of a Hadoop user that has read
access to the file.
If successful, the command produces output similar to the following:
HTTP/1.1 200 OKServer: Apache-Coyote/1.1
Set-Cookie:
hadoop.auth="u=hadoopUser&p=password&t=simple&e=1344383263490&s=n8YB/CHFg56qNmRQRT
qO0IdRMvE="; Version=1; Path=/
Page 51 of 135

Content-Type: application/octet-stream
Content-Length: 16
Date: Tue, 07 Aug 2012 13:47:44 GMT
A|1|2|3
B|4|5|6
If you are using Kerberos authentication, run the following command from the Linux
command line:
curl --negotiate -i -L -u:anyUser
http://hadoopNameNode:50070/webhdfs/v1/tmp/test.txt?op=OPEN
Replace hadoopNameNode with the hostname or IP address of the name node in your
Hadoop cluster, and /tmp/test.txt with the path to the file in the Hadoop filesystem you
located in step 1.
If successful, the command produces output similar to the following:
HTTP/1.1 401 UnauthorizedContent-Type: text/html; charset=utf-8
WWW-Authenticate: Negotiate
Content-Length: 0
Server: Jetty(6.1.26)
HTTP/1.1 307 TEMPORARY_REDIRECT
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: hadoop.auth="u=exampleuser&p=exampleuser@MYCOMPANY.COM&t=kerberos&
e=1375144834763&s=iY52iRvjuuoZ5iYG8G5g12O2Vwo=";Path=/
Location:
http://hadoopnamenode.mycompany.com:1006/webhdfs/v1/user/release/docexample/test.t
xt?
op=OPEN&delegation=JAAHcmVsZWFzZQdyZWxlYXNlAIoBQCrfpdGKAUBO7CnRju3TbBSlID_
osB658jfGf
RpEt8-u9WHymRJXRUJIREZTIGRlbGVnYXRpb24SMTAuMjAuMTAwLjkxOjUwMDcw&offset=0
Content-Length: 0
HTTP/1.1 200 OK
Content-Length: 16
A|1|2|3
B|4|5|6
If the curl command fails, you must review the error messages and resolve any issues before using
the HP Vertica Connector for HDFS with your Hadoop cluster. Some debugging steps include:
Page 52 of 135

Verify the HDFS service's port number.
Verify that the Hadoop user you specified exists and has read access to the file you are
attempting to retrieve.
Installing the HP Vertica Connector for HDFS

The HP Vertica Connector for HDFS is not included as part of the HP Vertica Server installation.
You must download it from my.vertica.com and install it on all nodes in your HP Vertica database.
The connector installation packages contains several support libraries in addition to the library for
the connector. Unlike some other packages supplied by HP, you need to install these package on
all of the hosts in your HP Vertica database so each host has a copy of the support libraries.
Downloading and Installing the HP Vertica Connector

for HDFS Package
Following these steps to install the HP Vertica Connector for HDFS:
1. Use a Web browser to log into the myVertica portal.
2. Click the Download tab.
3. Locate the section for the HP Vertica Connector for HDFS that you want and download the
installation package that matches the Linux distribution on your HP Vertica cluster.
4. Copy the installation package to each host in your database.
5. Log into each host as root and run the following command to install the connector package.
n
On Red Hat-based Linux distributions, use the command:

rpm -Uvh /path/installation-package.rpm
For example, if you downloaded the Red Hat installation package to the dbadmin home
directory, the command to install the package is:
rpm -Uvh /home/dbadmin/vertica-hdfs-connectors-7.1..x86_64.RHEL5.rpm
Page 53 of 135

On Debian-based systems, use the command:

dpkg -i /path/installation-package.deb
Once you have installed the connector package on each host, you need to load the connector library
into HP Vertica. See Loading the HDFS User Defined Source for instructions.
Loading the HDFS User Defined Source

Once you have installed the HP Vertica Connector for HDFS package on each host in your
database, you need to load the connector's library into HP Vertica and define the User Defined
Source (UDS) in the HP Vertica catalog. The UDS is component you use to access data from the
HDFS. The connector install package contains a SQL script named install.sql that performs
these steps for you. To run it:
1. Log into to an HP Vertica host using the database administrator account.
2. Execute the installation script:
vsql -f /opt/vertica/packages/hdfs_connectors/install.sql
3. Enter the database administrator password if prompted.
Note: You only need to run an installation script once in order to create the User Defined
Source in the HP Vertica catalog. You do not need to run the install script on each node.
The SQL install script loads the HP Vertica Connector for HDFS library and defines the HDFS User
Defined Source named HDFS. The output of successfully running the installation script looks like
this:
version
------------------------------------------Vertica Analytic Database v7.1.x
(1 row)
CREATE LIBRARY
CREATE SOURCE FUNCTION
Once the install script finishes running, the connector is ready to use.
Page 54 of 135

Loading Data Using the HP Vertica Connector for

HDFS
After you install the HP Vertica Connector for HDFS, you can use the HDFS User Defined Source
(UDS) in a COPY statement to load data from HDFS files.
The syntax for using the HDFS UDS in a COPY statement is:
COPY tableName SOURCE Hdfs(url='WebHDFSFileURL', [username='username'],
[low_speed_limit=speed]);
tableName
The name of the table to receive the copied data.
WebHDFSFileURL
A string containing one or more URLs that identify the file or files to be read. See
below for details. Use commas to separate multiple URLs .
Important: You must replace any commas in the URLs with the escape
sequence %2c. For example, if you are loading a file named doe,john.txt,
change the file's name in the URL to doe%2cjohn.txt.
username
The username of a Hadoop user that has permissions to access the files you
want to copy. If you are using Kerberos, omit this argument.
speed
The minimum data transmission rate, expressed in bytes per second, that the
connector allows. The connector breaks any connection between the Hadoop
and HP Vertica clusters that transmits data slower than this rate for more than 1
minute. After the connector breaks a connection for being too slow, it attempts
to connect to another node in the Hadoop cluster. This new connection can
supply the data that the broken connection was retrieving. The connector
terminates the COPYstatement and returns an error message if:
l
It cannot find another Hadoop node to supply the data.
The previous transfer attempts from all other Hadoop nodes that have the file
also closed because they were too slow.
Default Value:1048576 (1MB per second transmission rate)
Page 55 of 135

The HDFS File URL

The url parameter in the Hdfs function call is a string containing one or more comma-separated
HTTP URLs. These URLS identify the files in HDFS that you want to load. The format for each
URL in this string is:
http://NameNode:port/webhdfs/v1/HDFSFilePath
NameNode
The host name or IP address of the Hadoop cluster's name node.
Port
The port number on which the WebHDFS service is running. This number is
usually 50070 or 14000, but may be different in your Hadoop installation.
webhdfs/v1/
The protocol being used to retrieve the file. This portion of the URL is always the
same. It tells Hadoop to use version 1 of the WebHDFS API.
HDFSFilePath
The path from the root of the HDFS filesystem to the file or files you want to load.
This path can contain standard Linux wildcards.
Important: Any wildcards you use to specify multiple input files must resolve
to files only. They must not include any directories. For example, if you
specify the path /user/HadoopUser/output/*, and the output directory
contains a subdirectory, the connector returns an error message.
The following example shows how to use the HP Vertica Connector for HDFS to load a single file
named /tmp/test.txt. The Hadoop cluster's name node is named hadoop.
=> COPY testTable SOURCE Hdfs(url='http://hadoop:50070/webhdfs/v1/tmp/test.txt',
username='hadoopUser');
Rows Loaded
------------2
(1 row)
Copying Files in Parallel

The basic COPY statement in the previous example copies a single file. It runs on just a single host
in the HP Vertica cluster because the Connector cannot break up the workload among nodes. Any
data load that does not take advantage of all nodes in the HP Vertica cluster is inefficient.
To make loading data from HDFS more efficient, spread the data across multiple files on HDFS.
This approach is often natural for data you want to load from HDFS. Hadoop MapReduce jobs
usually store their output in multiple files.
Page 56 of 135

You specify multiple files to be loaded in your Hdfs function call by:
l
Using wildcards in the URL
Supplying multiple comma-separated URLs in the url parameter of the Hdfs user-defined source
function call
Supplying multiple comma-separated URLs that contain wildcards
Loading multiple files through the HP Vertica Connector for HDFS results in a efficient load. The
HP Vertica hosts connect directly to individual nodes in the Hadoop cluster to retrieve files. If
Hadoop has broken files into multiple chunks, the HP Vertica hosts directly connect to the nodes
storing each chunk.
The following example shows how to load all of the files whose filenames start with "part-" located
in the /user/hadoopUser/output directory on the HDFS. If there are at least as many files in this
directory as there are nodes in the HP Vertica cluster, all nodes in the cluster load data from the
HDFS.
=> COPY Customers SOURCE-> Hdfs
(url='http://hadoop:50070/webhdfs/v1/user/hadoopUser/output/part-*',
username='hadoopUser');
Rows Loaded
------------40008
(1 row)
To load data from multiple directories on HDFS at once use multiple comma-separated URLs in the
URL string:
=> COPY Customers SOURCE-> Hdfs
(url='http://hadoop:50070/webhdfs/v1/user/HadoopUser/output/part-*,
http://hadoop:50070/webhdfs/v1/user/AnotherUser/part-*',
username='H=hadoopUser');
Rows Loaded
------------80016
(1 row)
Note: HP Vertica statements must be less than 65,000 characters long. If you supply too
many long URLs in a single statement, you could go over this limit. Normally, you would only
approach this limit if you are automatically generating of the COPY statement using a program
or script.
Page 57 of 135

Viewing Rejected Rows and Exceptions

COPY statements that use the HP Vertica Connector for HDFS use the same method for recording
rejections and exceptions as other COPYstatements. Rejected rows and exceptions are saved to
log files. These log files are stored by default in the CopyErrorLogs subdirectory in the database's
catalog directory. Due to the distributed nature of the HP Vertica Connector for HDFS, you cannot
use the ON option to force all exception and rejected row information to be written to log files on a
single HP Vertica host. Instead, you need to collect the log files from across the hosts to review all
of the exceptions and rejections generated by the COPYstatement.
For more about handling rejected rows, see Capturing Load Rejections and Exceptions.
Creating an External Table Based on HDFS Files

You can use the HP Vertica Connector for HDFS as a source for an external table that lets you
directly perform queries on the contents of files on the Hadoop Distributed File System (HDFS).
See Using External Tables in the Administrator's Guide for more information on external tables.
Using an external table to access data stored on an HDFS is useful when you need to extract data
from files that are periodically updated, or have additional files added on the HDFS. It saves you
from having to drop previously loaded data and then reload the data using a COPY statement. The
external table always accesses the current version of the files on the HDFS.
Note: An external table performs a bulk load each time it is queried. Its performance is
significantly slower than querying an internal HP Vertica table. You should only use external
tables for infrequently-run queries (such as daily reports). If you need to frequently query the
content of the HDFS files, you should either use COPY to load the entire content of the files
into HP Vertica or save the results of a query run on an external table to an internal table which
you then use for repeated queries.
To create an external table that reads data from HDFS, you use the Hdfs User Defined Source
(UDS) in a CREATE EXTERNAL TABLE AS COPY statement. The COPY portion of this
statement has the same format as the COPY statement used to load data from HDFS. See Loading
Data Using the HP Vertica Connector for HDFS for more information.
The following simple example shows how to create an external table that extracts data from every
file in the /user/hadoopUser/example/output directory using the HP Vertica Connector for
HDFS.
=> CREATE EXTERNAL TABLE hadoopExample (A VARCHAR(10), B INTEGER, C INTEGER, D INTEGER)
Page 58 of 135

-> AS COPY SOURCE Hdfs(url=

-> 'http://hadoop01:50070/webhdfs/v1/user/hadoopUser/example/output/*',
-> username='hadoopUser');
CREATE TABLE
=> SELECT * FROM hadoopExample;
A
| B | C | D
-------+---+---+--test1 | 1 | 2 | 3
test1 | 3 | 4 | 5
(2 rows)
Later, after another Hadoop job adds contents to the output directory, querying the table produces
different results:
A
| B | C | D
-------+----+----+---test3 | 10 | 11 | 12
test3 | 13 | 14 | 15
test2 | 6 | 7 | 8
test2 | 9 | 0 | 10
test1 | 1 | 2 | 3
test1 | 3 | 4 | 5
(6 rows)
Load Errors in External Tables

Normally, querying an external table on HDFS does not produce any errors if rows rejected by the
underlying COPY statement (for example, rows containing columns whose contents are
incompatible with the data types in the table). Rejected rows are handled the same way they are in
a standard COPY statement: they are written to a rejected data file, and are noted in the exceptions
file. For more information on how COPY handles rejected rows and exceptions, see Capturing Load
Rejections and Exceptions in the Administrator's Guide.
Rejections and exception files are created on all of the nodes that load data from the HDFS. You
cannot specify a single node to receive all of the rejected row and exception information. These files
are created on each HP Vertica node as they process files loaded through the HP Vertica
Connector for HDFS.
Note: Since the the connector is read-only, there is no way to store rejection and exception
information on the HDFS.
Fatal errors during the transfer of data (for example, specifying files that do not exist on the HDFS)
do not occur until you query the external table. The following example shows what happens if you
recreate the table based on a file that does not exist on HDFS.
Page 59 of 135

=> DROP TABLE hadoopExample;

DROP TABLE
=> CREATE EXTERNAL TABLE hadoopExample (A INTEGER, B INTEGER, C INTEGER, D INTEGER)
-> AS COPY SOURCE HDFS(url='http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt',
-> username='hadoopUser');
CREATE TABLE
ERROR 0: Error calling plan() in User Function HdfsFactory at
[src/Hdfs.cpp:222], error code: 0, message: No files match
[http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt]
Note that it is not until you actually query the table that the the connector attempts to read the file.
Only then does it return an error.
Installing and Configuring Kerberos on Your HP

Vertica Cluster
You must perform several steps to configure your HP Vertica cluster before the HP Vertica
Connector for HDFS can authenticate an HP Vertica user using Kerberos.
Note: You only need to perform these configuration steps if you are using the connector with
Kerberos. In a non-Kerberos environment, the connector does not require your HP Vertica
cluster to have any special configuration.
Perform the following steps on each node in your HP Vertica cluster:
1. Install the Kerberos libraries and client. To learn how to install these packages, see the
documentation for your Linux distribution. On some Red Hat and CentOS version, you can
install these packages by executing the following commands as root:
yum install krb5-libs krb5-workstation
On some versions of Debian, you would use the command:

apt-get install krb5-config krb5-user krb5-clients
2. Update the Kerberos configuration file (/etc/krb5.conf) to reflect your site's Kerberos
configuration. The easiest method of doing this is to copy the /etc/krb5.conf file from your
Kerberos Key Distribution Center (KDC) server.
Page 60 of 135

3. Copy the keytab files for the users to a directory on the node (see Preparing Keytab Files for
the HP Vertica Connector for HDFS). The absolute path to these files must be the same on
every node in your HP Vertica cluster.
4. Ensure the keytab files are readable by the database administrator's Linux account (usually
dbadmin). The easiest way to do this is to change ownership of the files to dbadmin:
sudo chown dbadmin *.keytab
Installing and Configuring Kerberos on Your HP Vertica

Cluster
You must perform several steps to configure your HP Vertica cluster before the HP Vertica
Connector for HDFS can authenticate an HP Vertica user using Kerberos.
Note: You only need to perform these configuration steps if you are using the connector with
Kerberos. In a non-Kerberos environment, the connector does not require your HP Vertica
cluster to have any special configuration.
Perform the following steps on each node in your HP Vertica cluster:
1. Install the Kerberos libraries and client. To learn how to install these packages, see the
documentation for your Linux distribution. On some Red Hat and CentOS version, you can
install these packages by executing the following commands as root:
yum install krb5-libs krb5-workstation
On some versions of Debian, you would use the command:

apt-get install krb5-config krb5-user krb5-clients
2. Update the Kerberos configuration file (/etc/krb5.conf) to reflect your site's Kerberos
configuration. The easiest method of doing this is to copy the /etc/krb5.conf file from your
Kerberos Key Distribution Center (KDC) server.
3. Copy the keytab files for the users to a directory on the node (see Preparing Keytab Files for
the HP Vertica Connector for HDFS). The absolute path to these files must be the same on
Page 61 of 135

4. Ensure the keytab files are readable by the database administrator's Linux account (usually
dbadmin). The easiest way to do this is to change ownership of the files to dbadmin:
sudo chown dbadmin *.keytab
Preparing Keytab Files for the HP Vertica Connector for

HDFS
The HP Vertica Connector for HDFS uses keytab files to authenticate HP Vertica users with
Kerberos, so they can access files on the Hadoop HDFS filesystem. These files take the place of
entering a password at a Kerberos login prompt.
You must have a keytab file for each HP Vertica user that needs to use the connector. The keytab
file must match the Kerberos credentials of a Hadoop user that has access to the HDFS.
Caution: Exporting a keytab file for a user changes the version number associated with the
user's Kerberos account. This change invalidates any previously exported keytab file for
the user. If a keytab file has already been created for a user and is currently in use, you should
use that keytab file rather than exporting a new keytab file. Otherwise, exporting a new keytab
file will cause any processes using an existing keytab file to no longer be able to authenticate.
To export a keytab file for a user:
1. Start the kadmin utility:
n
If you have access to the root account on the Kerberos Key Distribution Center (KDC)
system, log into it and use the kadmin.local command. (If this command is not in the
system search path, try /usr/kerberos/sbin/kadmin.local.)
If you do not have access to the root account on the Kerberos KDC, then you can use the
kadmin command from a Kerberos client system as long as you have the password for the
Kerberos administrator account. When you start kadmin, it will prompt you for the Kerberos
administrator's password. You may need to have root privileges on the client system in
order to run kadmin.
2. Use kadmin's xst (export) command to export the user's credentials to a keytab file:
xst -k username.keytab username@YOURDOMAIN.COM
Page 62 of 135

where username is the name of Kerberos principal you want to export, and YOURDOMAIN.COM is
your site's Kerberos realm. This command creates a keytab file named username.keytab in
the current directory.
The following example demonstrates exporting a keytab file for a user named
exampleuser@MYCOMPANY.COM using the kadmin command on a Kerberos client system:
$ sudo kadmin
[sudo] password for dbadmin:
Authenticating as principal root/admin@MYCOMPANY.COM with password.
Password for root/admin@MYCOMPANY.COM:
kadmin: xst -k exampleuser.keytab exampleuser@MYCOMPANY.COM
Entry for principal exampleuser@MYCOMPANY.COM with kvno 2, encryption
type des3-cbc-sha1 added to keytab WRFILE:exampleuser.keytab.
Entry for principal exampleuser@VERTICACORP.COM with kvno 2, encryption
type des-cbc-crc added to keytab WRFILE:exampleuser.keytab.
After exporting the keyfile, you can use the klist command to list the keys stored in the file:
$ sudo klist -k exampleuser.keytab
Keytab name: FILE:exampleuser.keytab
KVNO Principal
---- ----------------------------------------------------------------2 exampleuser@MYCOMPANY.COM
2 exampleuser@MYCOMPANY.COM
HP Vertica Connector for HDFS Troubleshooting

Tips
The following sections explain some of the common issues you may encounter when using the HP
Vertica Connector for HDFS.
User Unable to Connect to Kerberos-Authenticated

Hadoop Cluster
A user may suddenly be unable to connect to Hadoop through the connector in a Kerberos-enabled
environment. This issue can be caused by someone exporting a new keytab file for the user, which
invalidates existing keytab files. You can determine if invalid keytab files is the problem by
comparing the key version number associated with the user's principal key in Kerberos with the key
version number stored in the keytab file on the HP Vertica cluster.
To find the key version number for a user in Kerberos:
Page 63 of 135

1. From the Linux command line, start the kadmin utility (kadmin.local if you are logged into the
Kerberos Key Distribution Center). Run the getprinc command for the user:
$ sudo kadmin
Authenticating as principal root/admin@MYCOMPANY.COM with password.
Password for root/admin@MYCOMPANY.COM:
kadmin: getprinc exampleuser@MYCOMPANY.COM
Principal: exampleuser@MYCOMPANY.COM
Expiration date: [never]
Last password change: Fri Jul 26 09:40:44 EDT 2013
Password expiration date: [none]
Maximum ticket life: 1 day 00:00:00
Maximum renewable life: 0 days 00:00:00
Last modified: Fri Jul 26 09:40:44 EDT 2013 (root/admin@MYCOMPANY.COM)
Last successful authentication: [never]
Last failed authentication: [never]
Failed password attempts: 0
Number of keys: 2
Key: vno 3, des3-cbc-sha1, no salt
Key: vno 3, des-cbc-crc, no salt
MKey: vno 0
Attributes:
Policy: [none]
In the preceding example, there are two keys stored for the user, both of which are at version
number (vno) 3.
2. To get the version numbers of the keys stored in the keytab file, use the klist command:
$ sudo klist -ek exampleuser.keytab
Keytab name: FILE:exampleuser.keytab
KVNO Principal
---- ---------------------------------------------------------------------2 exampleuser@MYCOMPANY.COM (des3-cbc-sha1)
2 exampleuser@MYCOMPANY.COM (des-cbc-crc)
3 exampleuser@MYCOMPANY.COM (des3-cbc-sha1)
3 exampleuser@MYCOMPANY.COM (des-cbc-crc)
The first column in the output lists the key version number. In the preceding example, the
keytab includes both key versions 2 and 3, so the keytab file can be used to authenticate the
user with Kerberos.
Resolving Error 5118

When using the connector, you may receive an error message similar to the following:
ERROR 5118: UDL specified no execution nodes; at least one execution node must be
specified
Page 64 of 135

To correct this error, verify that all of the nodes in your HP Vertica cluster have the correct version
of the HP Vertica Connector for HDFS package installed. This error can occur if one or more of the
nodes do not have the supporting libraries installed. These libraries may be missing because one of
the nodes was skipped when initially installing the connector package. Another possibility is that
one or more nodes have been added since the connector was installed.
Transfer Rate Errors

The HP Vertica Connector for HDFS monitors how quickly Hadoop sends data to HP Vertica.In
some cases, the data transfer speed on any connection between a node in your Hadoop cluster and
a node in your HP Vertica cluster slows beyond a lower limit (by default, 1 MB per second). When
the transfer slows beyond the lower limit, the connector breaks the data transfer. It then connects to
another node in the Hadoop cluster that contains the data it was retrieving. If it cannot find another
node in the Hadoop cluster to supply the data (or has already tried all of the nodes in the Hadoop
cluster), the connector terminates the COPYstatement and returns an error.
=> COPY messages SOURCE Hdfs
(url='http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt',
username='exampleuser');
ERROR 3399: Failure in UDx RPC call InvokeProcessUDL(): Error calling processUDL()
in User Defined Object [Hdfs] at [src/Hdfs.cpp:275], error code: 0,
message: [Transferring rate during last 60 seconds is 172655 byte/s, below threshold
1048576 byte/s, give up.
The last error message: Operation too slow. Less than 1048576 bytes/sec transferred the
last 1 seconds.
The URL:
http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt?op=OPEN&offset=154901544&length=1
13533912.
The redirected URL: http://hadoop.example.com:50075/webhdfs/v1/tmp/data.txt?op=OPEN&
namenoderpcaddress=hadoop.example.com:8020&length=113533912&offset=154901544.]
If you encounter this error, troubleshoot the connection between your HP Vertica and Hadoop
clusters. If there are no problems with the network, determine if either your Hadoop cluster or HP
Vertica cluster is overloaded. If the nodes in either cluster are too busy, they may not be able to
maintain the minimum data transfer rate.
If you cannot resolve the issue causing the slow transfer rate, you can lower the minimum
acceptable speed. To do so, set the low_speed_limit parameter for the Hdfs source. The following
example shows how to set low_speed_limit to 524288 to accept transfer rates as low as 512 KB per
second (half the default lower limit).
=> COPY messages SOURCE Hdfs
(url='http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt',
username='exampleuser', low_speed_limit=524288);
Rows Loaded
-------------
Page 65 of 135

9891287
(1 row)
When you lower the low_speed_limit parameter, the COPYstatement loading data from
HDFSmay take a long time to complete.
You can also increase the low_speed_limit setting if the network between your Hadoop cluster and
HP Vertica cluster is fast. You can choose to increase the lower limit to force COPYstatements to
generate an error, if they are running more slowly than they normally should, given the speed of the
network.
Page 66 of 135


The HP Vertica HCatalog Connector lets you access data stored in Apache's Hive data warehouse
software the same way you access it within a native HP Vertica table.
If your files are in the Optimized Columnar Row (ORC)format, you might be able to read them
directly instead of going through this connector. For more information, see Reading ORC Files
Directly.
Hive, HCatalog, and WebHCat Overview

There are several Hadoop components that you need to understand in order to use the HCatalog
connector:
l
Apache's Hive lets you query data stored in a HadoopDistributed File System (HDFS) the same
way you query data stored in a relational database. Behind the scenes, Hive uses a set of
serializer and deserializer (SerDe) classes to extract data from files stored on the HDFS and
break it into columns and rows. Each SerDe handles data files in a specific format. For example,
one SerDe extracts data from comma-separated data files while another interprets data stored in
JSON format.
Apache HCatalog is a component of the Hadoop ecosystem that makes Hive's metadata
available to other Hadoop components (such as Pig).
WebHCat (formerly known as Templeton) makes HCatalog and Hive data available via a
RESTweb API. Through it, you can make an HTTP request to retrieve data stored in Hive, as
well as information about the Hive schema.
HP Vertica's HCatalog Connector lets you transparently access data that is available through
WebHCat. You use the connector to define a schema in HP Vertica that corresponds to a Hive
database or schema. When you query data within this schema, the HCatalog Connector
transparently extracts and formats the data from Hadoop into tabular data. The data within this
HCatalog schema appears as if it is native to HP Vertica. You can even perform operations such as
joins between HP Vertica-native tables and HCatalog tables. For more details, see How the
HCatalog Connector Works.
HCatalog Connection Features

The HCatalog Connector lets you query data stored in Hive using the HP Vertica native
SQLsyntax. Some of its main features are:
Page 67 of 135

The HCatalog Connector always reflects the current state of data stored in Hive.
The HCatalog Connector uses the parallel nature of both HP Vertica and Hadoop to process
Hive data. The result is that querying data through the HCatalog Connector is often faster than
querying the data directly through Hive.
Since HP Vertica performs the extraction and parsing of data, the HCatalog Connector does not
signficantly increase the load on your Hadoop cluster.
The data you query through the HCatalog Connector can be used as if it were native HP Vertica
data. For example, you can execute a query that joins data from a table in an HCatalog schema
with a native table.
HCatalog Connection Considerations

There are a few things to keep in mind when using the HCatalog Connector:
l
Hive's data is stored in flat files in a distributed filesystem, requiring it to be read and
deserialized each time it is queried. This deserialization causes Hive's performance to be much
slower than HP Vertica. The HCatalog Connector has to perform the same process as Hive to
read the data. Therefore, querying data stored in Hive using the HCatalog Connector is much
slower than querying a native HP Vertica table. If you need to perform extensive analysis on
data stored in Hive, you should consider loading it into HP Vertica through the HCatalog
Connector or the WebHDFSconnector. HP Vertica optimization often makes querying data
through the HCatalog Connector faster than directly querying it through Hive.
Hive supports complex data types such as lists, maps, and structs that HP Vertica does not
support. Columns containing these data types are converted to a JSONrepresentation of the
data type and stored as a VARCHAR. See Data Type Conversions from Hive to HP Vertica.
Note: The HCatalog Connector is read only. It cannot insert data into Hive.

If your HDFS data is in the Optimized Row Columnar (ORC) format and uses no complex data
types, then instead of using the HCatalog Connector you can use the ORCReader to access the
data directly. Reading directly may provide better performance.
The decisions you make when writing ORCfiles can affect performance when using them. To get
the best performance from the ORCReader, do the following when writing:
Page 68 of 135

Use the latest available Hive version to write ORC files. (You can still read them with earlier
versions.)
Use a large stripe size; 256MB or greater is preferred.
Partition the data at the table level.
Sort the columns based on frequency of access, most-frequent first.
Use Snappy or ZLib compression.
Syntax
In the COPY statement, specify a format of ORCas follows:
COPY tableName FROM path ORC;
In the CREATE EXTERNAL TABLE AS COPY statement, specify a format of ORC as follows:
CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM path ORC;
If the file resides on the local file system of the node where you are issuing the command, use a
local file path for path. If the file resides elsewhere in HDFS, use the webhdfs:// prefix and then
specify the host name, port, and file path.Use ONANYNODE for files that are not local to improve
performance.
COPY t FROM 'webhdfs://somehost:port/opt/data/orcfile' ON ANY NODE ORC;
The ORC reader supports ZLib and Snappy compression.

The CREATEEXTERNALTABLEASCOPY statement must consume all of the columns in the
ORCfile; unlike with some other data sources, you cannot select only the columns of interest. If
you omit columns the ORC reader aborts with an error and does not copy any data.
If you load from multiple ORC files in the same COPY statement and any of them is aborted, the
entire load is aborted. This is different behavior than for delimited files, where the COPYstatement
loads what it can and ignores the rest.
Supported Data Types

The HP Vertica ORCfile reader can natively read columns of all data types supported in HIVE
version 0.11 and later except for complex types. If complex types such as maps are encountered,
Page 69 of 135

the COPY or CREATE EXTERNAL TABLE AS COPYstatement aborts with an error message.
The ORC reader does not attempt to read only some columns; either the entire file is read or the
operation fails. For a complete list of supported types, see HIVE Data Types.
Kerberos
If the ORC file is located on an HDFS cluster that uses Kerberos authentication, HP Vertica uses
the current user's principal to authenticate. It does not use the database's principal.
Query Performance
When working with external tables inORCformat, HP Vertica tries to improve performance in two
ways: by pushing query execution closer to the data so less has to be read and transmitted, and by
taking advantage of data locality in planning the query.
Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of
data that must be read from disk or across the network. ORC files have three levels of indexing: file
statistics, stripe statistics and row group indexes. Predicates are applied only to the first two levels.
Predicate pushdown works and is automatically applied for ORC files written with HIVE version
0.14 and later. ORCfiles written with earlier versions of HIVEmight not contain the required
statistics. When executing a query against an ORCfile that lacks these statistics, HP Vertica logs
an EXTERNAL_PREDICATE_PUSHDOWN_NOT_SUPPORTED event in the QUERY_
EVENTS system table. If you are seeing performance problems with your queries, check this table
for these events.
In a cluster where HP Vertica nodes are co-located on HDFSnodes, the query can also take
advantage of data locality. If data is on an HDFSnode where a database node is also present, and
if the query is not restricted to specific nodes using ONNODE, then the query planner uses that
database node to read that data. This allows HP Vertica to read data locally instead of making a
network call.
You can see how much ORCdata is being read locally by inspecting the query plan. The label for
LoadStep(s) in the plan contains a statement of the form:"X% of ORCdata matched with colocated Vertica nodes". To increase the volume of local reads, consider adding more database
nodes. HDFS data, by its nature, can't be moved to specific nodes, but if you run more database
nodes you increase the likelihood that a database node is local to one of the copies of the data.
Examples
The following example shows how to read from all ORC files in a directory.It uses all supported
data types.
Page 70 of 135

CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT,

a6 DOUBLE PRECISION, a7 BOOLEAN, a8 DATE, a9 TIMESTAMP,
a10 VARCHAR(20), a11 VARCHAR(20), a12 CHAR(20), a13 BINARY(20),
a14 DECIMAL(10,5))
AS COPY FROM '/data/orc_test_*.orc' ORC;
The following example shows the error that is produced if the file you specify is not recognized as
an ORC file:
CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT)
AS COPY FROM '/data/not_an_orc_file.orc' ORC;
ERROR 0: Failed to read orc source [/data/not_an_orc_file.orc]: Not an ORC file
How the HCatalog Connector Works

When planning a query that accesses data from a Hive table, the HP Vertica HCatalog Connector
on the initiator node contacts the WebHCat server in your Hadoop cluster to determine if the table
exists. If it does, the connector retrieves the table's metadata from the metastore database so the
query planning can continue. When the query executes, all nodes in the HP Vertica cluster directly
retrieve the data necessary for completing the query from the Hadoop HDFS. They then use the
Hive SerDe classes to extract the data so the query can execute.
Page 71 of 135

This approach takes advantage of the parallel nature of both HP Vertica and Hadoop. In addition, by
performing the retrieval and extraction of data directly, the HCatalog Connector reduces the impact
of the query on the Hadoop cluster.
HCatalog Connector Requirements

Before you can use the HCatalog Connector, both your HP Vertica and Hadoop installations must
meet the following requirements.
HP Vertica Requirements
All of the nodes in your cluster must have a Java Virtual Machine (JVM) installed. See Installing the
Java Runtime on Your HP Vertica Cluster.
You must also add certain libraries distributed with Hadoop and Hive to your HP Vertica installation
directory. See Configuring HP Vertica for HCatalog.
Hadoop Requirements
Your Hadoop cluster must meet several requirements to operate correctly with the HP Vertica
Connector for HCatalog:
l
It must have Hive and HCatalog installed and running. See Apache's HCatalog page for more
information.
It must have WebHCat (formerly known as Templeton)installed and running. See Apache' s
WebHCat page for details.
The WebHCat server and all of the HDFS nodes that store HCatalog data must be directly
accessible from all of the hosts in your HP Vertica database. Verify that any firewall separating
the Hadoop cluster and the HP Vertica cluster will pass WebHCat, metastore database, and
HDFS traffic.
The data that you want to query must be in an internal or external Hive table.
If a table you want to query uses a non-standard SerDe, you must install the SerDe's classes on
your HP Vertica cluster before you can query the data. See Using Non-Standard SerDes.
Testing Connectivity
To test the connection between your database cluster and WebHcat, log into a node in your HP
Vertica cluster. Then, run the following command to execute an HCatalog query:
Page 72 of 135

$ curl http://webHCatServer:port/templeton/v1/status?user.name=hcatUsername
Where:
l
webHCatServer is the IP address or hostname of the WebHCat server
port is the port number assigned to the WebHCat service (usually 50111)
hcatUsername is a valid username authorized to use HCatalog
Usually, you want to append ;echo to the command to add a linefeed after the curl command's
output. Otherwise, the command prompt is automatically appended to the command's output,
making it harder to read.
For example:
$ curl http://hcathost:50111/templeton/v1/status?user.name=hive; echo
If there are no errors, this command returns a status message in JSON format, similar to the
following:
{"status":"ok","version":"v1"}
This result indicates that WebHCat is running and that the HP Vertica host can connect to it and
retrieve a result. If you do not receive this result, troubleshoot your Hadoop installation and the
connectivity between your Hadoop and HP Vertica clusters. For details, see Troubleshooting
HCatalog Connector Problems.
You can also run some queries to verify that WebHCat is correctly configured to work with Hive.
The following example demonstrates listing the databases defined in Hive and the tables defined
within a database:
$ curl http://hcathost:50111/templeton/v1/ddl/database?user.name=hive; echo
{"databases":["default","production"]}
$ curl http://hcathost:50111/templeton/v1/ddl/database/default/table?user.name=hive; echo
{"tables":["messages","weblogs","tweets","transactions"],"database":"default"}
See Apache's WebHCat reference for details about querying Hive using WebHCat.
Page 73 of 135

Installing the Java Runtime on Your HP Vertica

Cluster
The HCatalog Connector requires a 64-bit Java Virtual Machine (JVM). The JVM must support
Java 6 or later, and must be the same version as the one installed on your Hadoop nodes.
Note: If your HP Vertica cluster is configured to execute User Defined Extensions (UDxs)
written in Java, it already has a correctly-configured JVM installed. See Developing User
Defined Functions in Java in the Extending HPVertica Guide for more information.
Installing Java on your HP Vertica cluster is a two-step process:
1. Install a Java runtime on all of the hosts in your cluster.
2. Set the JavaBinaryForUDx configuration parameter to tell HP Vertica the location of the Java
executable.

For Java-based features, HP Vertica requires a 64-bit Java 6 (Java version 1.6) or later Java
runtime. HP Vertica supports runtimes from either Oracle or OpenJDK. You can choose to install
either the Java Runtime Environment (JRE) or Java Development Kit (JDK), since the JDK also
includes the JRE.
Many Linux distributions include a package for the OpenJDKruntime. See your Linux distribution's
documentation for information about installing and configuring OpenJDK.
To install the Oracle Java runtime, see the Java Standard Edition (SE) Download Page. You
usually run the installation package as root in order to install it. See the download page for
instructions.
Once you have installed a JVM on each host, ensure that the java command is in the search path
and calls the correct JVM by running the command:
$ java -version
This command should print something similar to:

java version "1.6.0_37"Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)
Page 74 of 135

Note: Any previously installed Java VM on your hosts may interfere with a newly installed
Java runtime. See your Linux distribution's documentation for instructions on configuring which
JVM is the default. Unless absolutely required, you should uninstall any incompatible version
of Java before installing the Java 6 or Java 7 runtime.
Setting the JavaBinaryForUDx Configuration Parameter

The JavaBinaryForUDx configuration parameter tells HP Vertica where to look for the JRE to
execute Java UDxs. After you have installed the JRE on all of the nodes in your cluster, set this
parameter to the absolute path of the Java executable. You can use the symbolic link that some
Java installers create (for example /usr/bin/java). If the Java executable is in your shell search
path, you can get the path of the Java executable by running the following command from the Linux
command line shell:
$ which java
/usr/bin/java
If the java command is not in the shell search path, use the path to the Java executable in the
directory where you installed the JRE. Suppose you installed the JRE in /usr/java/default
(which is where the installation package supplied by Oracle installs the Java 1.6 JRE). In this case
the Java executable is /usr/java/default/bin/java.
You set the configuration parameter by executing the following statement as a database
superuser:
=> ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';
See ALTER DATABASE for more information on setting configuration parameters.

To view the current setting of the configuration parameter, query the CONFIGURATION_
PARAMETERS system table:
=> \x
Expanded display is on.
=> SELECT * FROM CONFIGURATION_PARAMETERS WHERE parameter_name = 'JavaBinaryForUDx';
-[ RECORD 1 ]-----------------+---------------------------------------------------------node_name
| ALL
parameter_name
| JavaBinaryForUDx
current_value
| /usr/bin/java
default_value
|
change_under_support_guidance | f
change_requires_restart
| f
description
| Path to the java binary for executing UDx written in Java
Page 75 of 135

Once you have set the configuration parameter, HP Vertica can find the Java executable on each
node in your cluster.
Note: Since the location of the Java executable is set by a single configuration parameter for
the entire cluster, you must ensure that the Java executable is installed in the same path on all
of the hosts in the cluster.
Configuring HP Vertica for HCatalog

Before you can use the HCatalog Connector, you must add certain Hadoop and Hive libraries to
your HP Vertica installation. You must also copy the Hadoop configuration files that specify
various connection properties. HP Vertica uses the values in those configuration files to make its
own connections to Hadoop.
You need only make these changes on one node in your cluster. After you do this you can install the
HCatalog connector.
Copy Hadoop JARs and Configuration Files

HP Vertica provides a tool, hcatUtil, to collect the required files from Hadoop. This tool was
introduced in version 7.1.1. (In a previous version you were required to copy these files manually.)
This tool copies selected jars and XML configuration files.
Note: If you plan to use HIVE to query files that use Snappy compression, you also need
access to the Snappy native libraries. Either include the path(s) for libhadoop*.so and
libsnappy*.so in the path you specify for --hcatLibPath or copy these files to a directory on that
path before beginning.
In order to use this tool you need access to the Hadoop files, which are found on nodes in the
Hadoop cluster. If HP Vertica is not co-located on a Hadoop node, you should do the following:
1. Copy /opt/vertica/packages/hcat/tools/hcatUtil to a Hadoop node and run it there, specifying a
temporary output directory. Your Hadoop, HIVE, and HCatalog lib paths might be different; in
particular, in newer versions of Hadoop the HCatalog directory is usually a subdirectory under
the HIVEdirectory. Use the values from your environment in the following command:
hcatUtil --copyJars --hadoopHiveHome="/hadoop/lib;/hive/lib;/hcatalog/dist/share"
Page 76 of 135

--hadoopHiveConfPath="/hadoop;/hive;/webhcat"
--hcatLibPath=/tmp/hadoop-files
2. Verify that all necessary files were copied:

hcatUtil --verifyJars --hcatLibPath=/tmp/hadoop-files
3. Copy that output directory (/tmp/hadoop-files, in this example) to

/opt/vertica/packages/hcat/lib on the HP Vertica node you will connect to when installing the
HCatalog connector. If you are updating a HP Vertica cluster to use a new Hadoop cluster (or a
new version of Hadoop), first remove all JAR files in /opt/vertica/packages/hcat/lib except
vertica-hcatalogudl.jar.
4. Verify that all necessary files were copied:
hcatUtil --verifyJars --hcatLibPath=/opt/vertica/packages/hcat
If you are using the HPVertica for SQLon Hadoop product with co-located clusters, you can do this
in one step on a shared node. Your Hadoop, HIVE, and HCatalog lib paths might be different; use
the values from your environment in the following command:
hcatUtil --copyJars --hadoopHiveHome="/hadoop/lib;/hive/lib;/hcatalog/dist/share"
--hadoopHiveConfPath="/hadoop;/hive;/webhcat"
--hcatLibPath=/opt/vertica/packages/hcat/lib
The hcatUtil script has the following arguments:

-c, --copyJars
copy the required JARs from hadoopHivePath to hcatLibPath.
-v, --verifyJars
verify that the required JARs are present in hcatLibPath.
--hadoopHiveHome=
"value1;value2;..."
paths to the Hadoop, Hive, and HCatalog home directories. Separate

multiple paths by a semicolon (;). Enclose paths in double quotes.
In newer versions of Hadoop, look for the HCatalog directory under the
HIVE directory (for example, /hive/hcatalog/share).
Page 77 of 135

-hcatLibPath=
"value1;value2;..."
output path of the lib/ folder of the HCatalog dependency JARs.

Usually this is /opt/vertica/packages/hcat. You may use any folder, but
make sure to copy all JARs to the hcat/lib folder before installing the
HCatalog connector. If you have previously run hcatUtil with a different
version of Hadoop, remove the old JARfiles first (all except verticahcatalogudl.jar).
-paths of the Hadoop, HIVE, and other components' configuration files

hadoopHiveConfPath= (such as core-site.xml, hive-site.xml, and webhcat-site.xml). Separate
"value"
multiple paths by a semicolon (;). Enclose paths in double quotes.
These files contain values that would otherwise have to be specified to
CREATEHCATALOGSCHEMA.
If you are using Cloudera, or if your HDFS cluster uses Kerberos
authentication, this parameter is required. Otherwise this parameter is
optional.
Once you have copied the files and verified them, install the HCatalog connector.
Install the HCatalog Connector

On the same node where you copied the files from hcatUtil, install the HCatalog connector by
running the install.sql script. This script resides in the ddl/ folder under your HCatalog connector
installation path. This script creates the library and VHCatSource and VHCatParser.
Note: The data that was copied using hcatUtil is now stored in the database. If you change any
of those values in Hadoop, you need to rerun hcatUtil and install.sql. The following statement
returns the names of the libraries and configuration files currently being used:
=> SELECT dependencies FROM user_libraries WHERE lib_name='VHCatalogLib';
Now you can create HCatalog schema parameters, which point to your existing
Hadoop/Hive/WebHCat services, as described in Defining a Schema Using the HCatalog
Connector.
Using the HCatalog Connector with HA NameNode

Newer distributions of Hadoop support the High Availability NameNode (HA NN) for HDFS access.
Some additional configuration is required to use this feature with the HCatalog Connector. If you do
not perform this configuration, attempts to retrieve data through the connector will produce an error.
Page 78 of 135

To use HANN with HP Vertica, first copy /etc/hadoop/conf from the HDFS cluster to every node in
your HP Vertica cluster. You can put this directory anywhere, but it must be in the same location on
every node. (In the example below it is in /opt/hcat/hadoop_conf.)
Then uninstall the HCat library, configure the UDx to use that configuration directory, and reinstall
the library:
=> \i /opt/vertica/packages/hcat/ddl/uninstall.sql
DROP LIBRARY
=> ALTER DATABASE mydb SET JavaClassPathSuffixForUDx = '/opt/hcat/hadoop_conf';
WARNING 2693: Configuration parameter JavaClassPathSuffixForUDx has been deprecated;
setting it has no effect
=> \i /opt/vertica/packages/hcat/ddl/install.sql
CREATE LIBRARY
CREATE SOURCE FUNCTION
GRANT PRIVILEGE
CREATE PARSER FUNCTION
GRANT PRIVILEGE
Despite the warning message, this step is necessary.

After taking these steps, HCatalog queries will now work.
Defining a Schema Using the HCatalog Connector

After you set up the HCatalog Connector, you can use it to define a schema in your HP Vertica
database to access the tables in a Hive database. You define the schema using the
CREATEHCATALOGSCHEMAstatement. See CREATEHCATALOGSCHEMA in the SQL
Reference Manual for a full description.
When creating the schema, you must supply at least two pieces of information:
l
the name of the schema to define in HP Vertica
the host name or IP address of Hive's metastore database (the database server that contains
metadata about Hive's data, such as the schema and table definitions)
Other parameters are optional. If you do not supply a value, HP Vertica uses default values.
After you define the schema, you can query the data in the Hive data warehouse in the same way
you query a native HP Vertica table. The following example demonstrates creating an HCatalog
schema and then querying several system tables to examine the contents of the new schema. See
Viewing Hive Schema and Table Metadata for more information about these tables.
=> CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' HCATALOG_SCHEMA='default'
Page 79 of 135

-> HCATALOG_USER='hcatuser';
CREATE SCHEMA
=> -- Show list of all HCatalog schemas
=> \x
Expanded display is on.
=> SELECT * FROM v_catalog.hcatalog_schemata;
-[ RECORD 1 ]--------+-----------------------------schema_id
| 45035996273748980
schema_name
| hcat
schema_owner_id
| 45035996273704962
schema_owner
| dbadmin
create_time
| 2013-11-04 15:09:03.504094-05
hostname
| hcathost
port
| 9933
webservice_hostname | hcathost
webservice_port
| 50111
hcatalog_schema_name | default
hcatalog_user_name
| hcatuser
metastore_db_name
| hivemetastoredb
=> -- List the tables in all HCatalog schemas
=> SELECT * FROM v_catalog.hcatalog_table_list;
-[ RECORD 1 ]------+-----------------table_schema_id
| 45035996273748980
table_schema
| hcat
hcatalog_schema
| default
table_name
| messages
hcatalog_user_name | hcatuser
| 45035996273748980
table_schema
| hcat
hcatalog_schema
| default
table_name
| weblog
| 45035996273748980
table_schema
| hcat
hcatalog_schema
| default
table_name
| tweets
Querying Hive Tables Using HCatalog Connector

Once you have defined the HCatalog schema, you can query data from the Hive database by using
the schema name in your query.
=> SELECT * from hcat.messages limit 10;
messageid |
userid
|
time
|
message
-----------+------------+---------------------+---------------------------------1 | nPfQ1ayhi | 2013-10-29 00:10:43 | hymenaeos cursus lorem Suspendis
2 | N7svORIoZ | 2013-10-29 00:21:27 | Fusce ad sem vehicula morbi
3 | 4VvzN3d
| 2013-10-29 00:32:11 | porta Vivamus condimentum
4 | heojkmTmc | 2013-10-29 00:42:55 | lectus quis imperdiet
Page 80 of 135

5
6
7
8
9
10
(10 rows)
|
|
|
|
|
|
coROws3OF
oDRP1i
AU7a9Kp
ZJWg185DkZ
E7ipAsYC3
kStCv
|
|
|
|
|
|
2013-10-29
2013-10-29
2013-10-29
2013-10-29
2013-10-29
2013-10-29
00:53:39
01:04:23
01:15:07
01:25:51
01:36:35
01:47:19
|
|
|
|
|
|
sit eleifend tempus a aliquam mauri

risus facilisis sollicitudin sceler
turpis vehicula tortor
sapien adipiscing eget Aliquam tor
varius Cum iaculis metus
aliquam libero nascetur Cum mal
Since the tables you access through the HCatalog Connector act like HP Vertica tables, you can
perform operations that use both Hive data and native HP Vertica data, such as a join:
=> SELECT u.FirstName, u.LastName, d.time, d.Message from UserData u
-> JOIN hcat.messages d ON u.UserID = d.UserID LIMIT 10;
FirstName | LastName |
time
|
Message
----------+----------+---------------------+----------------------------------Whitney
| Kerr
| 2013-10-29 00:10:43 | hymenaeos cursus lorem Suspendis
Troy
| Oneal
| 2013-10-29 00:32:11 | porta Vivamus condimentum
Renee
| Coleman | 2013-10-29 00:42:55 | lectus quis imperdiet
Fay
| Moss
| 2013-10-29 00:53:39 | sit eleifend tempus a aliquam mauri
Dominique | Cabrera | 2013-10-29 01:15:07 | turpis vehicula tortor
Mohammad | Eaton
| 2013-10-29 00:21:27 | Fusce ad sem vehicula morbi
Cade
| Barr
| 2013-10-29 01:25:51 | sapien adipiscing eget Aliquam tor
Oprah
| Mcmillan | 2013-10-29 01:36:35 | varius Cum iaculis metus
Astra
| Sherman | 2013-10-29 01:58:03 | dignissim odio Pellentesque primis
Chelsea
| Malone
| 2013-10-29 02:08:47 | pede tempor dignissim Sed luctus
(10 rows)
Viewing Hive Schema and Table Metadata

When using Hive, you access metadata about schemata and tables by executing statements
written in HiveQL(Hive's version of SQL) such as SHOW TABLES. When using the HCatalog
Connector, you can get metadata about the tables in the Hive database through several HP Vertica
system tables.
There are four system tables that contain metadata about the tables accessible through the
HCatalog Connector:
l
HCATALOG_SCHEMATAlists all of the schemata (plural of schema) that have been defined
using the HCatalog Connector. See HCATALOG_SCHEMATA in the SQL Reference Manual
for detailed information.
HCATALOG_TABLE_LIST contains an overview of all of the tables available from all schemata
defined using the HCatalog Connector. This table only shows the tables which the user querying
the table can access. The information in this table is retrieved using a single call to WebHCatfor
each schema defined using the HCatalog Connector, which means there is a little overhead
Page 81 of 135

when querying this table. See HCATALOG_TABLE_LIST in the SQL Reference Manual for
detailed information.
l
HCATALOG_TABLES contains more in-depth information than HCATALOG_TABLE_LIST.

However, querying this table results in HP Vertica making a RESTweb service call to
WebHCat for each table available through the HCatalog Connector. If there are many tables in
the HCatalog schemata, this query could take a while to complete. See HCATALOG_TABLES
in the SQL Reference Manual for more information.
HCATALOG_COLUMNS lists metadata about all of the columns in all of the tables available
through the HCatalog Connector. Similarly to HCATALOG_TABLES, querying this table results
in one call to WebHCat per table, and therefore can take a while to complete. See HCATALOG_
COLUMNS in the SQL Reference Manual for more information.
The following example demonstrates querying the system tables containing metadata for the tables
available through the HCatalogConnector.
=> CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost'
-> HCATALOG_SCHEMA='default' HCATALOG_DB='default' HCATALOG_USER='hcatuser';
CREATE SCHEMA
=> SELECT * FROM HCATALOG_SCHEMATA;
-[ RECORD 1 ]--------+----------------------------schema_id
| 45035996273864536
schema_name
| hcat
schema_owner_id
| 45035996273704962
schema_owner
| dbadmin
create_time
| 2013-11-05 10:19:54.70965-05
hostname
| hcathost
port
| 9083
webservice_hostname | hcathost
webservice_port
| 50111
hcatalog_schema_name | default
hcatalog_user_name
| hcatuser
metastore_db_name
| hivemetastoredb
=> SELECT * FROM HCATALOG_TABLE_LIST;
| 45035996273864536
table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
| 45035996273864536
table_schema
| hcat
hcatalog_schema
| default
table_name
| tweets
| 45035996273864536
Page 82 of 135

table_schema
| hcat
hcatalog_schema
| default
table_name
| messages
| 45035996273864536
table_schema
| hcat
hcatalog_schema
| default
table_name
| msgjson
=> -- Get detailed description of a specific table
=> SELECT * FROM HCATALOG_TABLES WHERE table_name = 'msgjson';
-[ RECORD 1 ]---------+----------------------------------------------------------table_schema_id
| 45035996273864536
table_schema
| hcat
hcatalog_schema
| default
table_name
| msgjson
hcatalog_user_name
| hcatuser
min_file_size_bytes
| 13524
total_number_files
| 10
location
| hdfs://hive.example.com:8020/user/exampleuser/msgjson
last_update_time
| 2013-11-05 14:18:07.625-05
output_format
| org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
last_access_time
| 2013-11-11 13:21:33.741-05
max_file_size_bytes
| 45762
is_partitioned
| f
partition_expression |
table_owner
| hcatuser
input_format
| org.apache.hadoop.mapred.TextInputFormat
total_file_size_bytes | 453534
hcatalog_group
| supergroup
permission
| rwxr-xr-x
=> -- Get list of columns in a specific table
=> SELECT * FROM HCATALOG_COLUMNS WHERE table_name = 'hcatalogtypes'
-> ORDER BY ordinal_position;
-[ RECORD 1 ]------------+----------------table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| intcol
hcatalog_data_type
| int
data_type
| int
data_type_id
| 6
data_type_length
| 8
character_maximum_length |
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 1
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
Page 83 of 135

column_name
| floatcol
hcatalog_data_type
| float
data_type
| float
data_type_id
| 7
data_type_length
| 8
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 2
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| doublecol
hcatalog_data_type
| double
data_type
| float
data_type_id
| 7
data_type_length
| 8
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 3
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| charcol
hcatalog_data_type
| string
data_type
| varchar(65000)
data_type_id
| 9
data_type_length
| 65000
character_maximum_length | 65000
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 4
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| varcharcol
hcatalog_data_type
| string
data_type
| varchar(65000)
data_type_id
| 9
data_type_length
| 65000
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
Page 84 of 135

ordinal_position
| 5
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| boolcol
hcatalog_data_type
| boolean
data_type
| boolean
data_type_id
| 5
data_type_length
| 1
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 6
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| timestampcol
hcatalog_data_type
| string
data_type
| varchar(65000)
data_type_id
| 9
data_type_length
| 65000
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 7
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| varbincol
hcatalog_data_type
| binary
data_type
| varbinary(65000)
data_type_id
| 17
data_type_length
| 65000
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 8
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| bincol
hcatalog_data_type
| binary
data_type
| varbinary(65000)
data_type_id
| 17
Page 85 of 135

data_type_length
character_maximum_length
numeric_precision
numeric_scale
datetime_precision
interval_precision
ordinal_position
| 65000
| 65000
|
|
|
|
| 9
Synching an HCatalog Schema With a Local

Schema
Querying data from an HCatalog schema can be slow due to Hive and WebHCat performance
issues. This slow performance can be especially annoying when you use the HCatalog Connector
to query the HCatalog schema's metadata to examine the structure of the tables in the Hive
database.
To avoid this problem you can use the SYNC_WITH_HCATALOG_SCHEMAfunction to create a
snapshot of the HCatalog schema's metadata within an HP Vertica schema. You supply this
function with the name of a pre-existing HP Vertica schema and an HCatalog schema available
through the HCatalog Connector. It creates a set of external tables within the HP Vertica schema
that you can then use to examine the structure of the tables in the Hive database. Because the
metadata in the HP Vertica schema is local, query planning is much faster. You can also use
standard HP Vertica statements and system tables queries to examine the structure of Hive tables
in the HCatalog schema.
Caution: The SYNC_WITH_HCATALOG_SCHEMA function overwrites tables in the HP
Vertica schema whose names match a table in the HCatalog schema. To avoid losing data,
always create an empty HP Vertica schema to sync with an HCatalog schema.
The HP Vertica schema is just a snapshot of the HCatalog schema's metadata. HP Vertica does
not synchronize later changes to the HCatalog schema with the local schema after you call SYNC_
WITH_HCATALOG_SCHEMA. You can call the function again to re-synchronize the local schema
to the HCatalog schema.
Note: By default, the function does not drop tables that appear in the local schema that do not
appear in the HCatalog schema. Thus after the function call the local schema does not reflect
tables that have been dropped in the Hive database. You can change this behavior by
supplying the optional third Boolean argument that tells the function to drop any table in the
local schema that does not correspond to a table in the HCatalog schema.
Page 86 of 135

The following example demonstrates calling SYNC_WITH_HCATALOG_SCHEMAto sync the

HCatalog schema named hcat with a local schema.
=> CREATE SCHEMA hcat_local;
CREATE SCHEMA
=> SELECT sync_with_hcatalog_schema('hcat_local', 'hcat');
sync_with_hcatalog_schema
---------------------------------------Schema hcat_local synchronized with hcat
tables in hcat = 56
tables altered in hcat_local = 0
tables created in hcat_local = 56
stale tables in hcat_local = 0
table changes erred in hcat_local = 0
(1 row)
=> -- Use vsql's \d command to describe a table in the synced schema
=> \d hcat_local.messages
List of Fields by Tables
Schema
|
Table | Column |
Type
| Size | Default | Not Null | Primary
Key | Foreign Key
-----------+----------+---------+----------------+-------+---------+----------+------------+------------hcat_local | messages | id
| int
|
8 |
| f
| f
|
hcat_local | messages | userid | varchar(65000) | 65000 |
| f
| f
|
hcat_local | messages | "time" | varchar(65000) | 65000 |
| f
| f
|
hcat_local | messages | message | varchar(65000) | 65000 |
| f
| f
|
(4 rows)
Note: You can query tables in the local schema you synched with an HCatalog schema.
Querying tables in a synched schema isn't much faster than directly querying the HCatalog
schema because SYNC_WITH_HCATALOG_SCHEMA only duplicates the HCatalog
schema's metadata. The data in the table is still retrieved using the HCatalog Connector,
Data Type Conversions from Hive to HP Vertica

The data types recognized by Hive differ from the data types recognize by HP Vertica. The
following table lists how the HCatalog Connector converts Hive data types into data types
compatible with HP Vertica.
Hive Data Type
VerticaData Type
TINYINT (1-byte)
TINYINT (8-bytes)
Page 87 of 135

Hive Data Type
VerticaData Type
SMALLINT (2-bytes)
SMALLINT (8-bytes)
INT (4-bytes)
INT (8-bytes)
BIGINT (8-bytes)
BIGINT (8-bytes)
BOOLEAN
BOOLEAN
FLOAT (4-bytes)
FLOAT (8-bytes)
DOUBLE (8-bytes)
DOUBLE PRECISION (8-bytes)
STRING (2 GB max)
VARCHAR (65000)
BINARY (2 GB max)
VARBINARY (65000)
LIST/ARRAY
VARCHAR(65000)containing a JSON-format representation of the list.
MAP
VARCHAR(65000)containing a JSON-format representation of the map.
STRUCT
VARCHAR(65000)containing a JSON-format representation of the

struct.
Data-Width Handling Differences Between Hive and HP

Vertica
The HCatalog Connector relies on Hive SerDe classes to extract data from files on HDFS.
Therefore, the data read from these files are subject to Hive's data width restrictions. For example,
suppose the SerDe parses a value for an INTcolumn into a value that is greater than 232-1 (the
maximum value for a 32-bit integer). In this case, the value is rejected even if it would fit into an HP
Vertica's 64-bit INTEGERcolumn because it cannot fit into Hive's 32-bit INT.
Once the value has been parsed and converted to an HP Vertica data type, it is treated as an native
data. This treatment can result in some confusion when comparing the results of an identical query
run in Hive and in HP Vertica. For example, if your query adds two INTvalues that result in a value
that is larger than 232-1, the value overflows its 32-bit INTdata type, causing Hive to return an error.
When running the same query with the same data in HP Vertica using the HCatalog Connector, the
value will probably still fit within HP Vertica's 64-int value. Thus the addition is successful and
returns a value.
Page 88 of 135

Using Non-Standard SerDes

Hive stores its data in unstructured flat files located in the HadoopDistributed File System (HDFS).
When you execute a Hive query, it uses a set of serializer and deserializer (SerDe)classes to
extract data from these flat files and organize it into a relational database table. For Hive to be able
to extract data from a file, it must have a SerDe that can parse the data the file contains. When you
create a table in Hive, you can select the SerDe to be used for the table's data.
Hive has a set of standard SerDes that handle data in several formats such as delimited data and
data extracted using regular expressions. You can also use third-party or custom-defined SerDes
that allow Hive to process data stored in other file formats. For example, some commonly-used
third-party SerDes handle data stored in JSON format.
The HCatalog Connector directly fetches file segments from HDFSand uses Hive's
SerDesclasses to extract data from them. The Connector includes all Hive's standard SerDes
classes, so it can process data stored in any file that Hive natively supports. If you want to query
data from a Hive table that uses a custom SerDe, you must first install the SerDe classes on the
HP Vertica cluster.
Determining Which SerDe You Need

If you have access to the Hive command line, you can determine which SerDe a table uses by
using Hive's SHOWCREATETABLE statement. This statement shows the HiveQLstatement
needed to recreate the table. For example:
hive> SHOW CREATE TABLE msgjson;
OK
CREATE EXTERNAL TABLE msgjson(
messageid int COMMENT 'from deserializer',
userid string COMMENT 'from deserializer',
time string COMMENT 'from deserializer',
message string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://hivehost.example.com:8020/user/exampleuser/msgjson'
TBLPROPERTIES (
'transient_lastDdlTime'='1384194521')
Time taken: 0.167 seconds
In the example, ROW FORMAT SERDEindicates that a special SerDe is used to parse the data files.
The next row shows that the class for the SerDe is named
Page 89 of 135

org.apache.hadoop.hive.contrib.serde2.JsonSerde.You must provide the HCatalog

Connector with a copy of this SerDe class so that it can read the data from this table.
You can also find out which SerDe class you need by querying the table that uses the custom
SerDe. The query will fail with an error message that contains the class name of the SerDe needed
to parse the data in the table. In the following example, the portion of the error message that names
the missing SerDe class is in bold.
=> SELECT * FROM hcat.jsontable;
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined
Object [VHCatSource], error code: 0
com.vertica.sdk.UdfException: Error message is [
org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat not
initialized, setOutput has to be called. Cause : java.io.IOException:
java.lang.RuntimeException:
MetaException(message:org.apache.hadoop.hive.serde2.SerDeException
SerDe com.cloudera.hive.serde.JSONSerDe does not exist) ] HINT If error
message is not descriptive or local, may be we cannot read metadata from hive
metastore service thrift://hcathost:9083 or HDFS namenode (check
UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information)
at com.vertica.hcatalogudl.HCatalogSplitsNoOpSourceFactory
.plan(HCatalogSplitsNoOpSourceFactory.java:98)
at com.vertica.udxfence.UDxExecContext.planUDSource(UDxExecContext.java:898)
. . .
Installing the SerDe on the HP Vertica Cluster

You usually have two options to getting the SerDe class file the HCatalog Connector needs:
l
Find the installation files for the SerDe, then copy those over to your HP Vertica cluster. For
example, there are several third-party JSONSerDes available from sites like Google Code and
GitHub. You may find the one that matches the file installed on your Hive cluster. If so, then
download the package and copy it to your HP Vertica cluster.
Directly copy the JARfiles from a Hive server onto your HP Vertica cluster. The location for the
SerDe JAR files depends on your Hive installation. On some systems, they may be located in
/usr/lib/hive/lib.
Wherever you get the files, copy them into the /opt/vertica/packages/hcat/lib directory on
Important: If you add a new host to your HP Vertica cluster, remember to copy every custom
SerDer JARfile to it.
Troubleshooting HCatalog Connector Problems

You may encounter the following issues when using the HCatalog Connector.
Page 90 of 135

Connection Errors
When you use CREATEHCATALOGSCHEMA to create a new schema, the HCatalog Connector
does not immediately attempt to connect to the WebHCat or metastore servers. Instead, when you
execute a query using the schema or HCatalog-related system tables, the connector attempts to
connect to and retrieve data from your Hadoop cluster.
The types of errors you get depend on which parameters are incorrect. Suppose you have incorrect
parameters for the metastore database, but correct parameters for WebHCat.In this case,
HCatalog-related system table queries succeed, while queries on the HCatalog schema fail. The
following example demonstrates creating an HCatalog schema with the correct default WebHCat
information. However, the port number for the metastore databaseis incorrect.
=> CREATE HCATALOG SCHEMA hcat2 WITH hostname='hcathost'
-> HCATALOG_SCHEMA='default' HCATALOG_USER='hive' PORT=1234;
CREATE SCHEMA
=> SELECT * FROM HCATALOG_TABLE_LIST;
-[ RECORD 1 ]------+--------------------table_schema_id
| 45035996273864536
table_schema
| hcat2
hcatalog_schema
| default
table_name
| test
hcatalog_user_name | hive
=> SELECT * FROM hcat2.test;
MetaException(message:Could not connect to meta store using any of the URIs
provided. Most recent failure: org.apache.thrift.transport.TTransportException:
java.net.ConnectException:
Connection refused
at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(
HiveMetaStoreClient.java:277)
. . .
To resolve these issues, you must drop the schema and recreate it with the correct parameters. If
you still have issues, determine whether there are connectivity issues between your HP Vertica
cluster and your Hadoop cluster. Such issues can include a firewall that prevents one or more HP
Vertica hosts from contacting the WebHCat, metastore, or HDFS hosts.
You may also see this error if you are using HA NameNode, particularly with larger tables that
HDFS splits into multiple blocks. See Using the HCatalog Connector with HA NameNode for more
information about correcting this problem.
Page 91 of 135

UDx Failure When Querying Data: Error 3399

You might see an error when querying data (as opposed to metadata like schema information). This
can happen for the following reasons:
l
You are not using the same version of Java on your Hadoop and HP Vertica nodes. In this case
you need to change one of them to match the other.
You have not used hcatUtil to copy the Hadoop and Hive libraries to HP Vertica.
You copied the libraries but they no longer match the versions of Hive and Hadoop that you are
using.
The version of Hadoop you are using relies on a third-party library that you must copy manually.
If you did not copy the libraries, follow the instructions in Configuring HP Vertica for HCatalog.
If the Hive jars that you copied from Hadoop are out of date, you might see the an error message
like the following:
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object
[VHCatSource],
error code: 0 Error message is [ Found interface org.apache.hadoop.mapreduce.JobContext,
but
class was expected ]
HINT hive metastore service is thrift://localhost:13433 (check
UDxLogs/UDxFencedProcessesJava.log
in the catalog directory for more information)
This usually signals a problem with hive-hcatalog-core jar. Make sure you have an up-to-date copy
of this. Remember that if you rerun hcatUtil you also need to re-create the HCatalog schema.
You might also see a different form of this error:
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object
[VHCatSource],
error code: 0 Error message is [ javax/servlet/Filter ]
This error can be reported even if hcatUtil reports that your libraries are up to date. The
javax.servlet.Filter class is in a library that some versions of Hadoop use but that is not usually part
of the Hadoop installation directly. If you see an error mentioning this class, locate servlet-api-*.jar
on a Hadoop node and copy it to the hcat/lib directory on all database nodes. If you cannot locate it
on a Hadoop node, locate and download it from the Internet. (This case is rare.) The library version
must be 2.3 or higher.
Page 92 of 135

Once you have copied the jar to the hcat/lib directory, reinstall the HCatalog connector as explained
in Configuring HP Vertica for HCatalog.
SerDe Errors
Errors can occur if you attempt to query a Hive table that uses a non-standard SerDe. If you have
not installed the SerDe JARfiles on your HP Vertica cluster, you receive an error similar to the one
in the following example:
=> SELECT * FROM hcat.jsontable;
java.lang.RuntimeException:
MetaException(message:org.apache.hadoop.hive.serde2.SerDeException
SerDe com.cloudera.hive.serde.JSONSerDe does not exist) ] HINT If error
message is not descriptive or local, may be we cannot read metadata from hive
metastore service thrift://hcathost:9083 or HDFS namenode (check
UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information)
at com.vertica.hcatalogudl.HCatalogSplitsNoOpSourceFactory
.plan(HCatalogSplitsNoOpSourceFactory.java:98)
at com.vertica.udxfence.UDxExecContext.planUDSource(UDxExecContext.java:898)
. . .
In the error message, you can see that the root cause is a missing SerDe class (shown in bold). To
resolve this issue, install the SerDe class on your HP Vertica cluster. See Using Non-Standard
SerDes for more information.
This error may occur intermittently if just one or a few hosts in your cluster do not have the SerDe
class.
Differing Results Between Hive and HP Vertica Queries

Sometimes, running the same query on Hive and on HP Vertica through the HCatalog Connector
can return different results. This discrepancy is often caused by the differences between the data
types supported by Hive and HP Vertica. See Data Type Conversions from Hive to HP Vertica for
more information about supported data types.
Preventing Excessive Query Delays

Network issues or high system loads on the WebHCat server can cause long delays while querying
a Hive database using the HCatalog Connector. While HP Vertica cannot resolve these issues,
you can set parameters that limit how long HP Vertica waits before canceling a query on an
HCatalog schema. You can set these parameters globally using HP Vertica configuration
Page 93 of 135

parameters. You can also set them for specific HCatalog schemas in the
CREATEHCATALOGSCHEMA statement. These specific settings override the settings in the
configuration parameters.
The HCatConnectionTimeout configuration parameter and the CREATEHCATALOGSCHEMA
statement's HCATALOG_CONNECTION_TIMEOUT parameter control how many seconds the
HCatalog Connector waits for a connection to the WebHCat server. Avalue of 0 (the default setting
for the configuration parameter)means to wait indefinitely. If the WebHCat server does not respond
by the time this timeout elapses, the HCatalog Connector breaks the connection and cancels the
query. If you find that some queries on an HCatalog schema pause excessively, try setting this
parameter to a timeout value, so the query does not hang indefinitely.
The HCatSlowTransferTime configuration parameter and the
CREATEHCATALOGSCHEMAstatement's HCATALOG_SLOW_TRANSFER_TIME parameter
specify how long the HCatlog Connector waits for data after making a successful connection to the
WebHCat server. After the specified time has elapsed, the HCatalog Connector determines
whether the data transfer rate from the WebHCat server is at least the value set in the
HCatSlowTransferLimit configuration parameter (or by the CREATE HCATALOG SCHEMA
statement's HCATALOG_SLOW_TRANSFER_LIMIT parameter). If it is not, then the HCatalog
Connector terminates the connection and cancels the query.
You can set these parameters to cancel queries that run very slowly but do eventually complete.
However, query delays are usually caused by a slow connection rather than a problem establishing
the connection. Therefore, try adjusting the slow transfer rate settings first. If you find the cause of
the issue is connections that never complete, you can alternately adjust the Linux TCP socket
timeouts to a suitable value instead of relying solely on the HCatConnectionTimeout parameter.
Page 94 of 135

Using the HP Vertica Storage Location for

HDFS
The HP Vertica Storage Location for HDFSlets HP Vertica store its data in a Hadoop Distributed
File System (HDFS) similarly to how it stores data on a native Linux filesystem. It lets you create a
storage tier for lower-priority data to free space on your HP Vertica cluster for higher-priority data.
For example, suppose you store website clickstream data in your HP Vertica database. You may
find that most queries only examine the last six months of this data. However, there are a few lowpriority queries that still examine data older than six months. In this case, you could choose to
move the older data to an HDFSstorage location so that it is still available for the infrequent
queries. The queries on the older data are slower because they now access data stored on HDFS
rather than native disks. However, you free space on your HP Vertica cluster's storage for higherpriority, frequently-queried data.
Storage Location for HDFS Requirements

To store HP Vertica's data on HDFS, verify that:
l
Your Hadoop cluster has WebHDFSenabled.
All of the nodes in your HP Vertica cluster can connect to all of the nodes in your Hadoop
cluster. Any firewall between the two clusters must allow connections on the ports used by
HDFS. See Testing Your Hadoop WebHDFS Configuration for a procedure to test the
connectivity between your HP Vertica and Hadoop clusters.
You have a Hadoop user whose username matches the name of the HP Vertica database
administrator (usually named dbadmin). This Hadoop user must have read and write access to
the HDFSdirectory where you want HP Vertica to store its data.
Your HDFShas enough storage available for HP Vertica data. See HDFSSpace Requirements
below for details.
The data you store in an HDFS-backed storage location does not expand your database's size
beyond any data allowance in your HP Vertica license. HP Vertica counts data stored in an
HDFS-backed storage location as part of any data allowance set by your license. See Managing
Licenses in the Administrator's Guide for more information.
Page 95 of 135

HDFS Space Requirements

If your HP Vertica database is K-safe, HDFS-based storage locations contain two copies of the
data you store in them. One copy is the primary projection, and the other is the buddy projection. If
you have enabled HDFS's data redundancy feature, Hadoop stores both projections multiple times.
This duplication may seem excessive. However, it is similar to how a RAIDlevel 1 or higher
redundantly stores copies of both HP Vertica's primary and buddy projections. The redundant
copies also help the performance of HDFSby enabling multiple nodes to process a request for a
file.
Verify that your HDFS installation has sufficient space available for redundant storage of both the
primary and buddy projections of your K-safe data. You can adjust the number of duplicates stored
by HDFS by setting the HadoopFSReplication configuration parameter. See Troubleshooting
HDFS Storage Locationsfor details.
Additional Requirements for Backing Up Data Stored

on HDFS
In Enterprise Edition, to back up your data stored in HDFS storage locations, your Hadoop cluster
must:
l
Have HDFS2.0 or later installed. The vbr.py backup utility uses the snapshot feature introduced
in HDFS2.0.
Have snapshotting enabled for the directories to be used for backups. The easiest way to do this
is to give the database administrator's account superuser privileges in Hadoop, so that
snapshotting can be set automatically. Alternatively, use Hadoop to enable snapshotting for
each directory before using it for backups.
In addition, your HP Vertica database must:

l
Have enough Hadoop components and libraries installed in order to run the Hadoop distcp
command as the HP Vertica database-administrator user (usually dbadmin).
Have the JavaBinaryForUDx and HadoopHome configuration parameters set correctly.
Caution: After you have created an HDFSstorage location, full database backups will fail with
the error message:
ERROR 5127: Unable to create snapshot No such file /usr/bin/hadoop:
check the HadoopHome configuration parameter
Page 96 of 135

This error is caused by the backup script not being able to back up the HDFS storage
locations. You must configure HP Vertica and Hadoop to enable the backup script to back
these locations. After you configure HP Vertica and Hadoop, you can once again perform full
database backups.
See Backing Up HDFS Storage Locationsfor details on configuring your HP Vertica and Hadoop
clusters to enable HDFSstorage location backup.
How the HDFSStorage Location Stores Data

The HP Vertica Storage Location for HDFSstores data on the Hadoop HDFS similarly to the way
HP Vertica stores data in the Linux file system. See Managing Storage Locations in the
Administrator's Guide for more information about storage locations. When you create a storage
location on HDFS, HP Vertica stores the ROS containers holding its data on HDFS. You can
choose which data uses the HDFSstorage location:from the data for just a single table to all of the
database's data.
When HP Vertica reads data from or writes data to an HDFSstorage location, the node storing or
retrieving the data contacts the Hadoop cluster directly to transfer the data. If a single
ROScontainer file is split among several Hadoop nodes, the HP Vertica node connects to each of
them. The HP Vertica node retrieves the pieces and reassembles the file. By having each node
fetch its own data directly from the source, data transfers are parallel, increasing their efficiency.
Having the HP Vertica nodes directly retrieve the file splits also reduces the impact on the Hadoop
cluster.
What You Can Store on HDFS

Use HDFSstorage locations to store only data. You cannot store catalog information in an HDFS
storage location.
Caution: While it is possible to use an HDFSstorage location for temporary data storage, you
must never do so. Using HDFSfor temporary storage causes severe performance issues. The
only time you change an HDFSstorage location's usage to temporary is when you are in the
process of removing it.
What HDFS Storage Locations Cannot Do

Because HP Vertica uses the storage locations to store ROScontainers in a proprietary format,
MapReduce and other Hadoop components cannot access your HP Vertica data stored in HDFS.
Page 97 of 135

Never allow another program that has access to HDFSto write to the ROS files. Any outside
modification of these files can lead to data corruption and loss.
Use the HP Vertica Connector for Hadoop MapReduce if you need your MapReduce job to access
HP Vertica data. Other applications must use the HP Vertica client libraries to access HP Vertica
data.
The storage location stores and reads only ROScontainers. It cannot read data stored in native
formats in HDFS. If you want HP Vertica to read data from HDFS, use the HP Vertica Connector
for HDFS. If the data you want to access is available in a Hive database, you can use the HP
Vertica Connector for HCatalog.
Creating an HDFS Storage Location

Before creating an HDFS storage location, you must first create a Hadoop user who can access the
data:
l
If your HDFS cluster is unsecured, create a Hadoop user whose username matches the user
name of the HP Vertica database administrator account. For example, suppose your database
administrator account has the default username dbadmin. You must create a Hadoop user
account named dbadmin and give it full read and write access to the directory on HDFS to store
files.
If your HDFS cluster uses Kerberos authentication, create a Kerberos principal for HP Vertica
and give it read and write access to the HDFS directory that will be used for the storage location.
See Configuring Kerberos.
Consult the documentation for your Hadoop distribution to learn how to create a user and grant the
user read and write permissions for a directory in HDFS.
Use the CREATE LOCATIONstatement to create an HDFS storage location. To do so, you must:
l
Supply the WebHDFSURI for HDFS directory where you want HP Vertica to store the
location's data as the path argument,. This URI is the same as a standard HDFSURL, except it
uses the webhdfs:// protocol and its path does not start with /webhdfs/v1/.
Include the ALLNODES SHAREDkeywords, as all HDFSstorage locations are shared

storage. This is required even if you have only one HDFSnode in your cluster.
The following example demonstrates creating an HDFS storage location that:
Page 98 of 135

Is located on the Hadoop cluster whose name node's host name is hadoop.
Stores its files in the /user/dbadmin directory.
Is labeled coldstorage.
The example also demonstrates querying the STORAGE_LOCATIONS system table to verify that
the storage location was created.
=> CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' ALL NODES SHARED
USAGE 'data' LABEL 'coldstorage';
CREATE LOCATION
=> SELECT node_name,location_path,location_label FROM STORAGE_LOCATIONS;
node_name
|
location_path
| location_label
------------------+------------------------------------------------------+--------------v_vmart_node0001 | /home/dbadmin/VMart/v_vmart_node0001_data
|
v_vmart_node0001 | webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 | coldstorage
v_vmart_node0002 | /home/dbadmin/VMart/v_vmart_node0002_data
|
v_vmart_node0003 | /home/dbadmin/VMart/v_vmart_node0003_data
|
(6 rows)
Each node in the cluster has created its own directory under the dbadmin directory in HDFS. These
individual directories prevent the nodes from interfering with each other's files in the shared
location.
Creating a Storage Location Using HP Vertica for

SQL on Hadoop
If you are using the Enterprise Edition product, then you typically use HDFS storage locations for
lower-priority data as shown in the previous example.If you are using the HPVertica for SQLon
Hadoop product, however, all of your data must be stored in HDFS.
To create an HDFSstorage location that complies with the HPVertica for SQLon Hadoop license,
first create the location on all nodes and then set its storage policy to HDFS. To create the location
in HDFS on all nodes:
=> CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' ALL NODES SHARED
USAGE 'data' LABEL 'HDFS';
Next, set the storage policy for your database to use this location:
=> SELECT set_object_storage_policy('DBNAME','HDFS);
Page 99 of 135

This causes all data to be written to the HDFSstorage location instead of the local disk.
Adding HDFS Storage Locations to New Nodes

Any nodes you add to your cluster do not have access to existing HDFS storage locations. You
must manually create the storage location for the new node using the
CREATELOCATIONstatement. Do not use the ALLNODESkeyword in this statement. Instead,
use the NODEkeyword with the name of the new node to tell HP Vertica that just that node needs
to add the shared location.
Caution: You must manually create the storage location. Otherwise, the new node uses the
default storage policy (usually, storage on the local Linux filesystem) to store data that the
other the nodes store in HDFS. As a result, the node can run out of disk space.
The following example shows how to add the storage location from the preceding example to a new
node named v_vmart_node0004:
=> CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' NODE 'v_vmart_node0004'
SHARED USAGE 'data' LABEL 'coldstorage';
Any active standby nodes in your cluster when you create an HDFS-based storage location
automatically create their own instances of the location. When the standby node takes over for a
down node, it uses its own instance of the location to store data for objects using the HDFS-based
storage policy. Treat standby nodes added after you create the storage location as any other new
node. You must manually define the HDFS storage location.
Creating a Storage Policy for HDFSStorage

Locations
After you create an HDFSstorage location, you assign database objects to the location by setting
storage policies. Based on these storage policies, database objects such as partition ranges,
individual tables, whole schemata, or even the entire database store their data in the HDFS storage
location. Use the SET_OBJECT_STORAGE_POLICYfunction to assign objects to an
HDFSstorage location. In the function call, supply the label you assigned to the HDFS storage
location as the location label argument. You do so using the CREATELOCATION statement's
LABELkeyword.
The following topics provide examples of storing data on HDFS.
Page 100 of 135

Storing an Entire Table in an HDFS Storage Location

The following example demonstrates using SET_OBJECT_STORAGE_POLICY to store a table in
an HDFSstorage location. The example statement sets the policy for an existing table, named
messages, to store its data in an HDFSstorage location, named coldstorage.
=> SELECT SET_OBJECT_STORAGE_POLICY('messages', 'coldstorage');
This table's data is moved to the HDFS storage location with the next merge-out. Alternatively, you
can have HP Vertica move the data immediately by using the enforce_storage_move parameter.
You can query the STORAGE_CONTAINERS system table and examine the location_label
column to verify that HP Vertica has moved the data:
=> SELECT node_name, projection_name, location_label, total_row_count FROM V_
MONITOR.STORAGE_CONTAINERS
WHERE projection_name ILIKE 'messages%';
node_name
| projection_name | location_label | total_row_count
------------------+-----------------+----------------+----------------v_vmart_node0001 | messages_b0
| coldstorage
|
366057
v_vmart_node0001 | messages_b1
| coldstorage
|
366511
| coldstorage
|
367432
| coldstorage
|
366057
| coldstorage
|
366511
| coldstorage
|
367432
(6 rows)
See Creating Storage Policies in the Administrator's Guide for more information about assigning
storage policies to objects.
Storing Table Partitions in HDFS

If the data you want to store in an HDFS-based storage location is in a partitioned table, you can
choose to store some of the partitions in HDFS. This capability lets you to periodically move old
data that is queried less frequently off of more costly higher-speed storage (such as on a solid- state
drive). You can instead use slower and less expensive HDFSstorage. The older data is still
accessible in queries, just at a slower speed. In this scenario, the faster storage is often referred to
as "hot storage," and the slower storage is referred to as "cold storage."
For example, suppose you have a table named messages containing social media messages that
is partitioned by the year and month of the message's timestamp. You can list the partitions in the
table by querying the PARTITIONS system table.
=> SELECT partition_key, projection_name, node_name, location_label FROM partitions
Page 101 of 135

ORDER BY partition_key;
partition_key | projection_name |
node_name
| location_label
--------------+-----------------+------------------+---------------201309
| messages_b1
| v_vmart_node0001 |
201309
| messages_b0
201309
| messages_b1
201309
| messages_b1
201309
| messages_b0
201309
| messages_b0
201310
| messages_b0
201310
| messages_b1
201310
| messages_b0
. . .
201405
| messages_b0
201405
| messages_b1
201405
| messages_b1
201405
| messages_b0
(54 rows)
Next, suppose you find that most queries on this table access only the latest month or two of data.
You may decide to move the older data to cold storage in an HDFS-based storage location. After
you move the data, it is still available for queries, but with lower query performance.
To move partitions to the HDFS storage location, supply the lowest and highest partition key values
to be moved in the SET_OBJECT_STORAGE_POLICY function call. The following example
shows how to move data between two dates to an HDFS-basedstorage location. In this example:
l
Partition key value 201309 represents September 2013.
Partition key value 201403 represents March 2014.
The name, coldstorage, is the label of the HDFS-basedstorage location.

=> SELECT SET_OBJECT_STORAGE_POLICY('messages','coldstorage', '201309', '201403'
USING PARAMETERS ENFORCE_STORAGE_MOVE = 'true');
After the statement finishes, the range of partitions now appear in the HDFS storage location
labeled coldstorage. This location name now displays in the PARTITIONS system table's location_
label column.
=> SELECT partition_key, projection_name, node_name, location_label
FROM partitions ORDER BY partition_key;
partition_key | projection_name |
node_name
| location_label
--------------+-----------------+------------------+---------------201309
| messages_b0
| v_vmart_node0003 | coldstorage
201309
| messages_b1
201309
| messages_b1
201309
| messages_b0
. . .
Page 102 of 135

201403
201404
201404
201404
201404
201404
201404
201405
201405
201405
201405
201405
201405
(54 rows)
|
|
|
|
|
|
|
|
|
|
|
|
|
messages_b0
messages_b0
messages_b0
messages_b1
messages_b1
messages_b0
messages_b1
messages_b0
messages_b1
messages_b0
messages_b0
messages_b1
messages_b1
|
|
|
|
|
|
|
|
|
|
|
|
|
v_vmart_node0002
v_vmart_node0001
v_vmart_node0002
v_vmart_node0001
v_vmart_node0002
v_vmart_node0003
v_vmart_node0003
v_vmart_node0001
v_vmart_node0002
v_vmart_node0002
v_vmart_node0003
v_vmart_node0001
v_vmart_node0003
| coldstorage
|
|
|
|
|
|
|
|
|
|
|
|
After your initial data move, you can move additional data to the HDFSstorage location
periodically. You move individual partitions or a range of partitions from the "hot" storage to the
"cold" storage location using the same method:
=> SELECT SET_OBJECT_STORAGE_POLICY('messages', 'coldstorage', '201404', '201404'
USING PARAMETERS ENFORCE_STORAGE_MOVE = 'true');
SET_OBJECT_STORAGE_POLICY
---------------------------Object storage policy set.
(1 row)
=> SELECT projection_name, node_name, location_label
FROM PARTITIONS WHERE PARTITION_KEY = '201404';
projection_name |
node_name
| location_label
-----------------+------------------+---------------messages_b0
messages_b0
messages_b1
messages_b0
messages_b1
messages_b1
(6 rows)
Moving Partitions to a Table Stored on HDFS

Another method of moving partitions from hot storage to cold storage is to move the partition's data
to a separate table that is stored on HDFS. This method breaks the data into two tables, one
containing hot data and the other containing cold data. Use this method if you want to prevent
queries from inadvertently accessing data stored in the slower HDFSstorage location. To query the
older data, you must explicitly query the cold table.
To move partitions:
Page 103 of 135

1. Create a new table whose schema matches that of the existing partitioned table.
2. Set the storage policy of the new table to use the HDFS-based storage location.
3. Use the MOVE_PARTITIONS_TO_TABLEfunction to move a range of partitions from the hot
table to the cold table.
The following example demonstrates these steps. You first create a table named cold_messages.
You then assign it the HDFS-based storage location named coldstorage, and, finally, move a range
of partitions.
=> CREATE TABLE cold_messages LIKE messages INCLUDING PROJECTIONS;
=> SELECT SET_OBJECT_STORAGE_POLICY('cold_messages', 'coldstorage');
=> SELECT MOVE_PARTITIONS_TO_TABLE('messages','201309','201403','cold_messages');
Note: The partitions moved using this method do not immediately migrate to the storage
location on HDFS. Instead, the Tuple Mover eventually moves them to the storage location.
Backing Up HP Vertica Storage Locations for HDFS

Note: The backup and restore features are available only in the Enterprise Edition product, not
in HPVertica for SQLon Hadoop.
HP recommends that you regularly back up the data in your HP Vertica database. This
recommendation includes data stored in your HDFSstorage locations. The HP Vertica backup
script (vbr.py) can back up HDFSstorage locations. However, you must perform several
configuration steps before it can back up these locations.
Caution: After you have created an HDFSstorage location, full database backups will fail with
the error message:
ERROR 5127: Unable to create snapshot No such file /usr/bin/hadoop:
check the HadoopHome configuration parameter
This error is caused by the backup script not being able to back up the HDFS storage
locations. You must configure HP Vertica and Hadoop to enable the backup script to back
these locations. After you configure HP Vertica and Hadoop, you can once again perform full
database backups.
There are several considerations for backing up HDFS storage locations in your database:
Page 104 of 135

The HDFS storage location backup feature relies on the snapshotting feature introduced in
HDFS2.0. You cannot back up an HDFS storage location stored on an earlier version of HDFS.
HDFSstorage locations do not support object-level backups. You must perform a full database
backup in order to back up the data in your HDFS storage locations.
Data in an HDFS storage location is backed up to HDFS. This backup guards against accidental
deletion or corruption of data. It does not prevent data loss in the case of a catastrophic failure of
the entire Hadoop cluster. To prevent data loss, you must have a backup and disaster recovery
plan for your Hadoop cluster.
Data stored on the Linux native filesystem is still backed up to the location you specify in the
backup configuration file. It and the data in HDFS storage locations are handled separately by
the vbr.py backup script.
You must configure your HP Vertica cluster in order to restore database backups containing an
HDFSstorage location. See Configuring HPVertica to Back Up HDFSStorage Locations for
the configuration steps you must take.
The HDFSdirectory for the storage location must have snapshotting enabled.You can either
directly configure this yourself or enable the database administrators Hadoop account to do it for
you automatically. See Configuring Hadoop to Enable Backup of HDFS Storage for more
information.
The topics in this section explain the configuration steps you must take to enable the backup of
HDFS storage locations.
Configuring HP Vertica to Restore HDFS Storage

Locations
Your HP Vertica cluster must be able to run the Hadoop distcp command to restore a backup of an
HDFS storage location. The easiest way to enable your cluster to run this command is to install
several Hadoop packages on each node. These packages must be from the same distribution and
version of Hadoop that is running on your Hadoop cluster.
The steps you need to take depend on:
l
The distribution and version of Hadoop running on the Hadoop cluster containing your HDFS
storage location.
The distribution of Linux running on your HP Vertica cluster.
Page 105 of 135

Note: Installing the Hadoop packages necessary to run distcp does not turn your HP Vertica
database into a Hadoop cluster. This process installs just enough of the Hadoop support files
on your cluster to run the distcp command. There is no additional overhead placed on the HP
Vertica cluster, aside from a small amount of additional disk space consumed by the Hadoop
support files.
Configuration Overview
The steps for configuring your HP Vertica cluster to restore backups for HDFS storage location are:
1. If necessary, install and configure a Java runtime on the hosts in the HP Vertica cluster.
2. Find the location of your Hadoop distribution's package repository.
3. Add the Hadoop distribution's package repository to the Linux package manager on all hosts in
your cluster.
4. Install the necessary Hadoop packages on your HP Vertica hosts.
5. Set two configuration parameters in your HP Vertica database related to Java and Hadoop.
6. If your HDFSstorage location uses Kerberos, set additional configuration parameters to allow
HP Vertica user credentials to be proxied.
7. Confirm that the Hadoop distcp command runs on your HP Vertica hosts.
The following sections describe these steps in greater detail.

You HP Vertica cluster must have a Java Virtual Machine (JVM)installed to run the Hadoop distcp
command. It already has a JVMinstalled if you have configured it to:
l
Execute User-Defined Extensions developed in Java. See Developing User Defined Functions
in Java for more information.
Access Hadoop data using the HCatalog Connector. See Using the HCatalog Connector for
more information.
If your HP Vertica database does have a JVMinstalled, you must verify thatyour Hadoop
distribution supports it. See your Hadoop distribution's documentation to determine which JVMs it
supports.
Page 106 of 135

If the JVMinstalled on your HP Vertica cluster is not supported by your Hadoop distribution you
must uninstall it. Then you must install a JVMthat is supported by both HP Vertica and your
Hadoop distribution. See HP Vertica SDKs in Supported Platforms for a list of the JVMs compatible
with HP Vertica.
If your HP Vertica cluster does not have a JVM(or its existing JVMis incompatible with your
Hadoop distribution),follow the instruction in Installing the Java Runtime on Your HP Vertica
Cluster.
Finding Your Hadoop Distribution's Package Repository

Many Hadoop distributions have their own installation system, such as Cloudera's Manager or
Hortonwork's Ambari. However, they also support manual installation using native Linux packages
such as RPM and .deb files. These package files are maintained in a repository. You can configure
your HP Vertica hosts to access this repository to download and install Hadoop packages.
Consult your Hadoop distribution's documentation to find the location of its Linux package
repository. This information is often located in the portion of the documentation covering manual
installation techniques. For example:
l
The Hortonworks Version 2.1 topic on Configuring the Remote Repositories.
The "Steps to Install CDH 5 Manually" section of the Cloudera Version 5.1.0 topic Installing
CDH5.
Each Hadoop distribution maintains separate repositories for each of the major Linux package
management systems. Find the specific repository for the Linux distribution running on your HP
Vertica cluster. Be sure that the package repository that you select matches version of Hadoop
distribution installed on your Hadoop cluster.
Configuring HPVertica Nodes to Access the Hadoop

DistributionsPackage Repository
Configure the nodes in your HP Vertica cluster so they can access your Hadoop distribution's
package repository. Your Hadoop distribution's documentation should explain how to add the
repositories to your Linux platform. If the documentation does not explain how to add the repository
to your packaging system, refer to your Linux distribution's documentation.
The steps you need to take depend on the package management system your Linux platform uses.
Usually, the process involves:
Page 107 of 135

Downloading a configuration file.
Adding the configuration file to the package management system's configuration directory.
For Debian-based Linux distributions, adding the Hadoop repository encryption key to the root
account keyring.
Updating the package management system's index to have it discover new packages.
The following example demonstrates adding the Hortonworks 2.1 package repository to an Ubuntu
12.04 host. These steps in this example are explained in the Hortonworks documentation.
$ wget http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/hdp.list \
-O /etc/apt/sources.list.d/hdp.list
--2014-08-20 11:06:00-- http://public-repo1.hortonworks.com/HDP/ubuntu12/2.1.3.0/hdp.list
Connecting to 16.113.84.10:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 161 [binary/octet-stream]
Saving to: `/etc/apt/sources.list.d/hdp.list'
100%[======================================>] 161
--.-K/s
in 0s
2014-08-20 11:06:00 (8.00 MB/s) - `/etc/apt/sources.list.d/hdp.list' saved [161/161]

$ gpg --keyserver pgp.mit.edu --recv-keys B9733A7A07513CAD
gpg: requesting key 07513CAD from hkp server pgp.mit.edu
gpg: /root/.gnupg/trustdb.gpg: trustdb created
gpg: key 07513CAD: public key "Jenkins (HDP Builds) <jenkin@hortonworks.com>" imported
gpg: Total number processed: 1
gpg:
imported: 1 (RSA: 1)
$ gpg -a --export 07513CAD | apt-key add OK
$ apt-get update
Hit http://us.archive.ubuntu.com precise Release.gpg
Hit http://extras.ubuntu.com precise Release.gpg
Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B]
Hit http://us.archive.ubuntu.com precise-updates Release.gpg
Get:2 http://public-repo-1.hortonworks.com HDP-UTILS Release.gpg [836 B]
Get:3 http://public-repo-1.hortonworks.com HDP Release.gpg [836 B]
Hit http://us.archive.ubuntu.com precise-backports Release.gpg
Hit http://extras.ubuntu.com precise Release
Get:4 http://security.ubuntu.com precise-security Release [50.7 kB]
Get:5 http://public-repo-1.hortonworks.com HDP-UTILS Release [6,550 B]
Hit http://us.archive.ubuntu.com precise Release
Hit http://extras.ubuntu.com precise/main Sources
Get:6 http://public-repo-1.hortonworks.com HDP Release [6,502 B]
Hit http://us.archive.ubuntu.com precise-updates Release
Get:7 http://public-repo-1.hortonworks.com HDP-UTILS/main amd64 Packages [1,955 B]
Get:8 http://security.ubuntu.com precise-security/main Sources [108 kB]
Get:9 http://public-repo-1.hortonworks.com HDP-UTILS/main i386 Packages [762 B]
Page 108 of 135

. . .
Reading package lists... Done
You must add the Hadoop repository to all hosts in your HP Vertica cluster.
Installing the Required Hadoop Packages

After configuring the repository, you are ready to install the Hadoop packages. The packages you
need to install are:
l
hadoop
hadoop-hdfs
hadoop-client
The names of the packages are usually the same across all Hadoop and Linux distributions.These
packages often have additional dependencies. Always accept any additional packages that the
Linux package manager asks to install.
To install these packages, use the package manager command for your Linux distribution. The
package manager command you need to use depends on your Linux distribution:
l
On Red Hat and CentOS, the package manager command is yum.
On Debian and Ubuntu, the package manager command is apt-get.
On SUSE the package manager command is zypper.
Consult your Linux distribution's documentation for instructions on installing packages.

The following example demonstrates installing the required Hadoop packages from the
Hortonworks 2.1 distribution on an Ubuntu 12.04 system.
# apt-get install hadoop hadoop-hdfs hadoop-client
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
bigtop-jsvc hadoop-mapreduce hadoop-yarn zookeeper
The following NEW packages will be installed:
bigtop-jsvc hadoop hadoop-client hadoop-hdfs hadoop-mapreduce hadoop-yarn
zookeeper
0 upgraded, 7 newly installed, 0 to remove and 90 not upgraded.
Need to get 86.6 MB of archives.
After this operation, 99.8 MB of additional disk space will be used.
Do you want to continue [Y/n]? Y
Page 109 of 135

Get:1 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main

bigtop-jsvc amd64 1.0.10-1 [28.5 kB]
zookeeper all 3.4.5.2.1.3.0-563 [6,820 kB]
hadoop all 2.4.0.2.1.3.0-563 [21.5 MB]
hadoop-hdfs all 2.4.0.2.1.3.0-563 [16.0 MB]
hadoop-yarn all 2.4.0.2.1.3.0-563 [15.1 MB]
hadoop-mapreduce all 2.4.0.2.1.3.0-563 [27.2 MB]
hadoop-client all 2.4.0.2.1.3.0-563 [3,650 B]
Fetched 86.6 MB in 1min 2s (1,396 kB/s)
Selecting previously unselected package bigtop-jsvc.
(Reading database ... 197894 files and directories currently installed.)
Unpacking bigtop-jsvc (from .../bigtop-jsvc_1.0.10-1_amd64.deb) ...
Selecting previously unselected package zookeeper.
Unpacking zookeeper (from .../zookeeper_3.4.5.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop.
Unpacking hadoop (from .../hadoop_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-hdfs.
Unpacking hadoop-hdfs (from .../hadoop-hdfs_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-yarn.
Unpacking hadoop-yarn (from .../hadoop-yarn_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-mapreduce.
Unpacking hadoop-mapreduce (from .../hadoop-mapreduce_2.4.0.2.1.3.0-563_all.deb) ...
Selecting previously unselected package hadoop-client.
Unpacking hadoop-client (from .../hadoop-client_2.4.0.2.1.3.0-563_all.deb) ...
Processing triggers for man-db ...
Setting up bigtop-jsvc (1.0.10-1) ...
Setting up zookeeper (3.4.5.2.1.3.0-563) ...
update-alternatives: using /etc/zookeeper/conf.dist to provide /etc/zookeeper/conf
(zookeeper-conf) in auto mode.
Setting up hadoop (2.4.0.2.1.3.0-563) ...
update-alternatives: using /etc/hadoop/conf.empty to provide /etc/hadoop/conf (hadoopconf) in auto mode.
Setting up hadoop-hdfs (2.4.0.2.1.3.0-563) ...
Setting up hadoop-yarn (2.4.0.2.1.3.0-563) ...
Setting up hadoop-mapreduce (2.4.0.2.1.3.0-563) ...
Setting up hadoop-client (2.4.0.2.1.3.0-563) ...
Processing triggers for libc-bin ...
ldconfig deferred processing now taking place
Setting Configuration Parameters

You must set two configuration parameters to enable HP Vertica to restore HDFSdata:
l
JavaBinaryForUDx is the path to the Java executable. You may have already set this value to
use Java UDxs or the HCatalog Connector. You can find the path for the default Java
executable from the Bash command shell using the command:
Page 110 of 135

which java
l
HadoopHome is the path where Hadoop is installed on the HP Vertica hosts. This is the
directory that contains bin/hadoop (the bin directory containing the Hadoop executable file).
The default value for this parameter is /usr. The default value is correct if your Hadoop
executable is located at /usr/bin/hadoop.
The following example demonstrates setting and then reviewing the values of these parameters.
=> ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';
=> SELECT get_config_parameter('JavaBinaryForUDx');
get_config_parameter
---------------------/usr/bin/java
(1 row)
=> ALTER DATABASE mydb SET HadoopHome = '/usr';

=> SELECT get_config_parameter('HadoopHome');
get_config_parameter
---------------------/usr
(1 row)
There are additional parameters you may, optionally, set:

l
HadoopFSReadRetryTimeout and HadoopFSWriteRetryTimeout specify how long to wait

before failing. The default value for each is 180 seconds, the Hadoop default. If you are confident
that your file system will fail more quickly, you can potentially improve performance by lowering
these values.
HadoopFSReplication is the number of replicas HDFS makes. By default the Hadoop client
chooses this; HP Vertica uses the same value for all nodes. We recommend against changing
this unless directed to.
HadoopFSBlockSizeBytes is the block size to write to HDFS; larger files are divided into blocks
of this size. The default is 64MB.
Setting Kerberos Parameters

If your HP Vertica nodes are co-located on HDFS nodes and you are using Kerberos, you must
change some Hadoop configuration parameters. These changes are needed in order for restoring
from backups to work. In yarn-site.xml on every HP Vertica node, set the following parameters:
Page 111 of 135

Parameter
Value
yarn.resourcemanager.proxy-user-privileges.enabled
true
yarn.resourcemanager.proxyusers.*.groups
yarn.resourcemanager.proxyusers.*.hosts
yarn.resourcemanager.proxyusers.*.users
yarn.timeline-service.http-authentication.proxyusers.*.groups
yarn.timeline-service.http-authentication.proxyusers.*.hosts
yarn.timeline-service.http-authentication.proxyusers.*.users
No changes are needed on HDFS nodes that are not also HP Vertica nodes.
Confirming that distcp Runs

Once the packages are installed on all hosts in your cluster, your database should be able to run the
Hadoop distcp command. To test it:
1. Log into any host in your cluster as the database administrator.
2. At the Bash shell, enter the command:
$ hadoop distcp
3. The command should print a message similar to the following:

usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-async
Should distcp execution be blocking
-atomic
Commit all changes or none
-bandwidth <arg>
Specify bandwidth per map in MB
-delete
Delete from target, files missing in source
-f <arg>
List of files that need to be copied
-filelimit <arg>
(Deprecated!) Limit number of files copied to <= n
-i
Ignore failures during copy
-log <arg>
Folder on DFS where distcp execution logs are
saved
-m <arg>
Max number of concurrent maps to use for copy
-mapredSslConf <arg>
Configuration for ssl config file, to use with
hftps://
-overwrite
Choose to overwrite target files unconditionally,
even if they exist.
Page 112 of 135

-p <arg>
-sizelimit <arg>
-skipcrccheck
-strategy <arg>
-tmp <arg>
-update
preserve status (rbugpc)(replication, block-size,

user, group, permission, checksum-type)
(Deprecated!) Limit number of files copied to <= n
bytes
Whether to skip CRC checks between source and
target paths.
Copy strategy to use. Default is dividing work
based on file sizes
Intermediate work path to be used for atomic
commit
Update target, copying only missingfiles or
directories
4. Repeat these steps on the other hosts in your database to ensure all of the hosts can run
distcp.
Troubleshooting
If you cannot run the distcp command, try the following steps:
l
If Bash cannot find the hadoop command, you may need to manually add Hadoop's bin
directory to the system search path. An alternative is to create a symbolic link in an existing
directory in the search path (such as /usr/bin) to the hadoop binary.
Ensure the version of Java installed on your HP Vertica cluster is compatible with your Hadoop
distribution.
Review the Linux package installation tool's logs for errors. In some cases, packages may not
be fully installed, or may not have been downloaded due to network issues.
Ensure that the database administrator account has permission to execute the hadoop
command. You may need to add the account to a specific group in order to allow it to run the
necessary commands.
Configuring Hadoop and HP Vertica to Enable Backup

of HDFS Storage
The HP Vertica backup script uses HDFS's snapshotting feature to create a backup of HDFS
storage locations. Adirectory must allow snapshotting before HDFScan take a snapshot. Only a
Hadoop superuser can enable snapshotting on a directory. HP Vertica can enable snapshotting
automatically if the database administrator is also a Hadoop superuser.
If HDFSis unsecured, the following instructions apply to the database administrator account,
usually dbadmin. If HDFSuses Kerberos security, the following instructions apply to the principal
Page 113 of 135

stored in the HP Vertica keytab file, usually vertica. The instructions below use the term "database
account" to refer to this user.
We recommend that you make the database administrator or principal a Hadoop superuser. If you
are not able to do so, you must enable snapshotting on the directory before configuring it for use by
HP Vertica.
The steps you need to take to make the HP Vertica database administrator account a superuser
depend on the distribution of Hadoop you are using. Consult your Hadoop distribution's
documentation for details. Instructions for two distributions are provided here.
Granting Superuser Status on Hortonworks 2.1

To make the database account a Hadoop superuser:
1. Log into the your Hadoop cluster's Hortonworks Hue web user interface. If your Hortonworks
cluster uses Ambari or you do not have a web-based user interface, see the Hortonworks
documentation for information on granting privileges to users.
2. Click the User Admin icon.
3. In the Hue Users page, click the database account''susername.
4. Click the Step 3: Advanced tab.
5. Select Superuser status.
Granting Superuser Status on Cloudera 5.1

Cloudera Hadoop treats Linux users that are members of the group named supergroup as
superusers. Cloudera Manager does not automatically create this group. Cloudera also does not
create a Linux user for each Hadoop user. To create a Linux account for the database account and
assign the supergroup to it:
1. Log into your Hadoop cluster's NameNode as root.
2. Use the groupadd command to add a group named supergroup.
3. Cloudera does not automatically create a Linux user that corresponds to the database
administrator's Hadoop account. If the Linux system does not have a user for your database
accountyou must create it. Use the adduser command to create this user.
4. Use the usermod command to add the database account to supergroup.
Page 114 of 135

5. Verify that the database account is now a member of supergroup using the groups command.
6. Repeat steps 1 through 5 for any other NameNodes in your Hadoop cluster.
The following example demonstrates following these steps to grant the database administrator
superuser status.
# adduser dbadmin
# groupadd supergroup
# usermod -a -G supergroup dbadmin
# groups dbadmin
dbadmin : dbadmin supergroup
Consult the Linux distribution installed on your Hadoop cluster for more information on managing
users and groups.
Manually Enabling Snapshotting for a Directory

If you cannot grant superuser status to the database account, you can instead enable snapshotting
of each directory manually. Use the following command:
hdfs dfsadmin -allowSnapshot path
Issue this command for each directory on each node. Remember to do this each time you add a
new node to your HDFS cluster.
Nested snapshottable directories are not allowed, so you cannot enable snapshotting for a parent
directory to automatically enable it for child directories. You must enable it for each individual
directory.
Additional Requirements for Kerberos

IfHDFSuses Kerberos, then in addition to granting the keytab principal access, you must set a HP
Vertica configuration parameter. In HP Vertica, set the HadoopConfDir parameter to the location of
the directory containing the core-site.xml, hdfs-site.xml, and yarn-site.xml configuration files:
=> ALTER DATABASE exampledb SET HadoopConfDir = '/hadoop';
All three configuration files must be present in this directory.

If your HP Vertica nodes are not co-located on HDFS nodes, then you must copy these files from
an HDFSnode to each HP Vertica node. Use the same path on every database node, because
HadoopConfDir is a global value.
Page 115 of 135

Testing the Database Account's Ability to Make

HDFSDirectories Snapshottable
After making the database account a Hadoop superuser, you should verify that the account can set
directories snapshottable:
1. Log into the Hadoop cluster as the database account (dbadmin by default).
2. Determine a location in HDFSwhere the database administrator can create a directory. The
/tmp directory is usually available. Create a test HDFSdirectory using the command:
hdfs dfs -mkdir /path/testdir
3. Make the test directory snapshottable using the command:

hdfs dfsadmin -allowSnapshot /path/testdir
The following example demonstrates creating an HDFSdirectory and making it snapshottable:

$ hdfs dfs -mkdir /tmp/snaptest
$ hdfs dfsadmin -allowSnapshot /tmp/snaptest
Allowing snaphot on /tmp/snaptest succeeded
Performing Backups Containing HDFS Storage

Locations
After you configure Hadoop and HP Vertica, HDFSstorage locations are automatically backed up
when you perform a full database backup. If you already have a backup configuration file for a full
database backup, you do not need to make any changes to it. You just run the vbr.py backup script
as usual to perform the full database backup. See Creating Full and Incremental Backups in the
Administrator's Guide for instructions on running the vbr.py backup script.
If you do not have a backup configuration file for a full database backup, you must create one to
back up the data in your HDFS storage locations. See Creating vbr.py Configuration Files in the
Administrator's Guide for more information.
Removing HDFS Storage Locations

The steps to remove an HDFSstorage location are similar to standard storage locations:
Page 116 of 135

1. Remove any existing data from the HDFSstorage location.

2. Change the location's usage to TEMP.
3. Retire the location on each host that has the storage location defined by using RETIRE_
LOCATION. You can use the enforce_storage_move parameter to make the change
immediately, or wait for the Tuple Mover to perform its next movout.
4. Drop the location on each host that has the storage location defined by using DROP_
LOCATION.
5. Optionally remove the snapshots and files from the HDFSdirectory for the storage location.
The following sections explain each of these steps in detail.
Important: If you have backed up the data in the HDFS storage location you are removing, you
must perform a full database backup after you remove the location. If you do not and restore the
database to a backup made before you removed the location, the location's data is restored.
Removing Existing Data from an HDFS Storage

Location
You cannot drop a storage location that contains data or is used by any storage policy. You have
several options to remove data and storage policies:
l
Drop all of the objects (tables or schemata)that store data in the location. This is the simplest
option. However, you can only use this method if you no longer need the data stored in the
HDFSstorage location.
Change the storage policies of objects stored on HDFSto another storage location. When you
alter the storage policy, you force all of the data in HDFSlocation to move to the new location.
This option requires that you have an alternate storage location available.
Clear the storage policies of all objects that store data on the storage location. You then move
the location's data through a process of retiring it.
The following sections explain the last two options in greater detail.
Page 117 of 135

Moving Data to Another Storage Location

You can move data off of an HDFS storage location by altering the storage policies of the objects
that use the location. Use the SET_OBJECT_STORAGE_POLICY function to change each
object's storage location. If you set this function's third argument to true, it moves the data off of the
storage location before returning.
The following example demonstrates moving the table named test from the hdfs2 storage location
to another location named ssd.
=> SELECT node_name, projection_name, location_label, total_row_count
FROM V_MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'test%';
node_name
------------------+-----------------+----------------+----------------v_vmart_node0001 | test_b1
| hdfs2
|
333631
v_vmart_node0001 | test_b0
| hdfs2
|
332233
| hdfs2
|
332233
| hdfs2
|
333631
| hdfs2
|
334136
| hdfs2
|
333631
| hdfs2
|
333631
| hdfs2
|
334136
| hdfs2
|
332233
| hdfs2
|
334136
| hdfs2
|
334136
| hdfs2
|
332233
(12 rows)
=> select set_object_storage_policy('test','ssd', true);
set_object_storage_policy
-------------------------------------------------Object storage policy set.
Task: moving storages
(Table: public.test) (Projection: public.test_b0)
(Table: public.test) (Projection: public.test_b1)
(1 row)
node_name
| ssd
|
332233
| ssd
|
332233
| ssd
|
333631
| ssd
|
333631
| ssd
|
334136
| ssd
|
334136
| ssd
|
332233
| ssd
|
332233
| ssd
|
333631
| ssd
|
333631
| ssd
|
334136
Page 118 of 135

(12 rows)
| ssd
334136
Once you have moved all of the data in the storage location, you are ready to proceed to the next
step of removing the storage location.
Clearing Storage Policies

Another option to move data off of a storage location is to clear the storage policy of each object
storing data in the location. You clear an object's storage policy using the CLEAR_OBJECT_
STORAGE_POLICY function. Once you clear the storage policy, the Tuple Mover eventually
migrates the object's data from the storage location to the database's default storage location. The
TM moves the data when itperforms a move storage operation. This operation runs infrequently at
low priority. Therefore, it may be some time before the data migrates out of the storage location.
You can speed up the data migration process by:
1. Calling the RETIRE_LOCATION function to retire the storage location on each host that
defines it.
2. Calling the MOVE_RETIRED_LOCATION_DATA function to move the location's data to the
database's default storage location.
3. Calling the RESTORE_LOCATION function to restore the location on each host that defines it.
You must perform this step because you cannot drop retired storage locations.
The following example demonstrates clearing the object storage policy of a table stored on HDFS,
then performing the steps to move the data off of the location.
=> SELECT * FROM storage_policies;
schema_name | object_name | policy_details | location_label
-------------+-------------+----------------+---------------public
| test
| Table
| hdfs2
(1 row)
=> SELECT clear_object_storage_policy('test');
clear_object_storage_policy
-------------------------------Object storage policy cleared.
(1 row)
=> SELECT retire_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001',
'v_vmart_node0001');
retire_location
--------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 retired.
(1 row)
Page 119 of 135


retire_location
(1 row)
retire_location
(1 row)
=> SELECT node_name, projection_name, location_label, total_row_count FROM
V_MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'test%';
node_name
| hdfs2
|
333631
| hdfs2
|
332233
| hdfs2
|
332233
| hdfs2
|
334136
| hdfs2
|
334136
| hdfs2
|
333631
(6 rows)
=> SELECT move_retired_location_data();
move_retired_location_data
----------------------------------------------Move data off retired storage locations done
(1 row)
node_name
|
|
332233
|
|
333631
|
|
334136
|
|
332233
|
|
333631
|
|
334136
(6 rows)
=> SELECT restore_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001',
restore_location
---------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 restored.
(1 row)
restore_location
----------------------------------------------------------------
Page 120 of 135

webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 restored.
(1 row)
restore_location
---------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 restored.
(1 row)
Changing the Usage of HDFS Storage Locations

You cannot drop a storage location that allows the storage of data files (ROS containers). Before
you can drop an HDFS storage location, you must change its usage from DATA to TEMP using the
ALTER_LOCATION_USE function. Make this change on every host in the cluster that defines the
storage location.
Important: HP recommends that you do not use HDFSstorage locations for temporary file
storage. Only set HDFS storage locations to allow temporary file storage as part of the removal
process.
The following example demonstrates using the ALTER_LOCATION_USE function to change the
HDFS storage location to temporary file storage. The example calls the function three times: once
for each node in the cluster that defines the location.
=> SELECT ALTER_LOCATION_USE('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001',
'v_vmart_node0001','TEMP');
ALTER_LOCATION_USE
--------------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 usage changed.
(1 row)
ALTER_LOCATION_USE
(1 row)
ALTER_LOCATION_USE
(1 row)
Page 121 of 135

Dropping an HDFS Storage Location

After removing all data and changing the data usage of an HDFS storage location, you can drop it.
Use the DROP_LOCATION function to drop the storage location from each host that defines it.
The following example demonstrates dropping an HDFSstorage location from a three-node HP
Vertica database.
=> SELECT DROP_LOCATION('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001',
DROP_LOCATION
--------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 dropped.
(1 row)
DROP_LOCATION
(1 row)
DROP_LOCATION
(1 row)
Removing Storage Location Files from HDFS

Dropping an HDFSstorage location does not automatically clean the HDFSdirectory that stored
the location's files. Any snapshots of the data files created when backing up the location are also
not deleted. These files consume disk space on HDFSand also prevent the directory from being
reused as an HDFS storage location. HP Vertica refuses to create a storage location in a directory
that contains existing files or subdirectories. You must log into the Hadoop cluster to delete the files
from HDFS. An alternative is to use some other HDFSfile management tool.
Removing Backup Snapshots

HDFSreturns an error if you attempt to remove a directory that has snapshots:
$ hdfs dfs -rm -r -f -skipTrash /user/dbadmin/v_vmart_node0001
rm: The directory /user/dbadmin/v_vmart_node0001 cannot be deleted since
/user/dbadmin/v_vmart_node0001 is snapshottable and already has snapshots
Page 122 of 135

The HP Vertica backup script creates snapshots of HDFS storage locations as part of the backup
process. See Backing Up HDFS Storage Locations for more information. If you made backups of
your HDFS storage location, you must delete the snapshots before removing the directories.
HDFS stores snapshots in a subdirectory named .snapshot. You list the snapshots in the directory
using the standard HDFSls command. The following example demonstrates listing the snapshots
defined for node0001.
$ hdfs dfs -ls /user/dbadmin/v_vmart_node0001/.snapshot
Found 1 items
drwxrwx--- dbadmin supergroup
0 2014-09-02 10:13 /user/dbadmin/v_vmart_
node0001/.snapshot/s20140902-101358.629
To remove snapshots, use the command:

hdfs dfs -removeSnapshot directory snapshotname
The following example demonstrates the command to delete the snapshot shown in the previous
example:
$ hdfs dfs -deleteSnapshot /user/dbadmin/v_vmart_node0001 s20140902-101358.629
You must delete each snapshot from the directory for each host in the cluster. Once you have
deleted the snapshots, you can delete the directories in the storage location.
Important: Each snapshot's name is based on a timestamp down to the millisecond. Nodes
independently create their own snapshot. They do not synchronize snapshot creation, so their
snapshot names differ. You must list each node's snapshot directory to learn the names of the
snapshots it contains.
See Apache's HDFSSnapshot documentation for more information about managing and removing
snapshots.
Removing the Storage Location Directories

You can remove the directories that held the storage location's data by either of the following
methods:
l
Use an HDFSfile manager to delete directories. See your Hadoop distribution's documentation
to determine if it provides a file manager.
Log into the Hadoop NameNode using the database administrators account and use HDFS's
rmr command to delete the directories. See Apache's File System Shell Guide for more
information.
Page 123 of 135

The following example uses the HDFS rmr command from the Linux command line to delete the
directories left behind in the HDFS storage location directory /user/dbamin. It uses the skipTrash flag to force the immediate deletion of the files.
$ hdfsp dfs -ls /user/dbadmin
Found 3 items
node0001
node0002
node0003

$ hdfs dfs -rmr -skipTrash /user/dbadmin/*

Deleted /user/dbadmin/v_vmart_node0001
Troubleshooting HDFS Storage Locations

This topic explains some common issues with HDFSstorage locations.
HDFS Storage Disk Consumption

By default, HDFS makes three copies of each file it stores. This replication help prevent data loss
due to disk or system failure. It also helps increase performance by allowing several nodes to
handle a request for a file.
An HP Vertica database with a K-Safety value of 1 or greater also stores its data redundantly using
buddy projections.
When a K-Safe HP Vertica database stores data in an HDFSstorage location, its data redundancy
is compounded by HDFS's redundancy. HDFSstores three copies of the primary projection's data,
plus three copies of the buddy projection for a total of six copies of the data.
If you want to reduce the amount of disk storage used by HDFS locations, you can alter the number
of copies of data that HDFSstores. The HP Vertica configuration parameter named
HadoopFSReplication controls the number of copies of data HDFSstores.
You can determine the current HDFSdisk usage by logging into the Hadoop NameNode and
issuing the command:
hdfs dfsadmin -report
This command prints the usage for the entire HDFSstorage, followed by details for each node in
the Hadoop cluster. The following example shows the beginning of the output from this command,
with the total disk space highlighted:
Page 124 of 135

$ hdfs dfsadmin -report

Configured Capacity: 51495516981 (47.96 GB)
Present Capacity: 32087212032 (29.88 GB)
DFS Remaining: 31565144064 (29.40 GB)
DFS Used: 522067968 (497.88 MB)
DFS Used%: 1.63%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
. . .
After loading a simple million-row table into a table stored in an HDFSstorage location, the report
shows greater disk usage:
DFS Remaining: 31373565952 (29.22 GB)
DFS Used: 711733386 (678.76 MB)
DFS Used%: 2.22%
Missing blocks: 0
. . .
The following HP Vertica example demonstrates:

1. Dropping the table in HP Vertica.
2. Setting the HadoopFSReplication configuration option to 1. This tells HDFSto store a single
copy of an HDFS storage location's data.
3. Recreating the table and reloading its data.
=> DROP TABLE messages;
DROP TABLE
=> ALTER DATABASE mydb SET HadoopFSReplication = 1;
=> CREATE TABLE messages (id INTEGER, text VARCHAR);
CREATE TABLE
=> SELECT SET_OBJECT_STORAGE_POLICY('messages', 'hdfs');
SET_OBJECT_STORAGE_POLICY
---------------------------Object storage policy set.
(1 row)
=> COPY messages FROM '/home/dbadmin/messages.txt' DIRECT;
Rows Loaded
------------1000000
Running the HDFSreport on Hadoop now shows less disk space use:
Page 125 of 135

$ hdfs dfsadmin -report

DFS Remaining: 31500988416 (29.34 GB)
DFS Used: 585289774 (558.18 MB)
DFS Used%: 1.82%
Missing blocks: 0
. . .
Caution: Reducing the number of copies of data stored by HDFS increases the risk of data
loss. It can also negatively impact the performance of HDFS by reducing the number of nodes
that can provide access to a file. This slower performance can impact the performance of HP
Vertica queries that involve data stored in an HDFS storage location.
Kerberos Authentication When Creating a Storage

Location
If HDFSuses Kerberos authentication, then the CREATELOCATIONstatement authenticates
using the HP Vertica keytab principal, not the principal of the user performing the action. If the
creation fails with an authentication error, verify that you have followed the steps described in
Configuring Kerberos to configure this principal.
When creating an HDFSstorage location on a Hadoop cluster using Kerberos,
CREATELOCATIONreports the principal being used as in the following example:
=> CREATE LOCATION 'webhdfs://hadoop.example.com:50070/user/dbadmin' ALL NODES SHARED
USAGE 'data' LABEL 'coldstorage';
NOTICE 0: Performing HDFS operations using kerberos principal
[vertica/hadoop.example.com]
CREATE LOCATION
Backup or Restore Fails When Using Kerberos

When backing up an HDFSstorage location that uses Kerberos, you might see an error such as:
createSnapshot: Failed on local exception: java.io.IOException:
java.lang.IllegalArgumentException: Server has invalid Kerberos principal:
hdfs/test.example.com@EXAMPLE.COM;
When restoring an HDFS storage location that uses Kerberos, you might see an error such as:
Page 126 of 135

Error msg: Initialization thread logged exception:

Distcp failure!
Either of these failures means that HP Vertica could not find the required configuration files in the
HadoopConfDir directory. Usually this is because you have set the parameter but not copied the
files from an HDFSnode to your HP Vertica node. See "Additional Requirements for Kerberos" in
Configuring Hadoop and HP Vertica to Enable Backup of HDFS Storage .
Page 127 of 135

Integrating HP Vertica with the MapR Distribution of Hadoop
Integrating HP Vertica with the MapR

Distribution of Hadoop
MapR is a distribution of Apache Hadoop produced by MapR Technologies that extends the
standard Hadoop components with its own features. By adding HP Vertica to a MapR cluster, you
can benefit from the advantages of both HP Vertica and Hadoop. To learn more about integrating
HP Vertica and MapR, see Configuring HPVertica Analytics Platform with MapR,which appears
on the MapR website.
Page 128 of 135


If your Hadoop cluster uses Kerberos authentication to restrict access to HDFS, you must
configure HP Vertica to make authenticated connections. The details of this configuration vary,
based on which methods you are using to access HDFS data:
l

HP Vertica authenticates with Hadoop in two ways that require different configurations:
l
User AuthenticationOn behalf of the user, by passing along the user's existing Kerberos
credentials, as occurs with the HDFSConnector and the HCatalog Connector.
HP Vertica AuthenticationOn behalf of system processes (such as the Tuple Mover), by

using a special Kerberos credential stored in a keytab file.
User Authentication
To use HP Vertica with Kerberos and Hadoop, the client user first authenticates with the Kerberos
server (Key Distribution Center, or KDC) being used by the Hadoop cluster. A user might run kinit or
sign in to Active Directory, for example. A user who authenticates to aKerberos server receives a
Kerberos ticket. At the beginning of a client session, HP Vertica automatically retrieves this
ticket.The database then uses this ticket to get a Hadoop token, which Hadoop uses to grant
access. HP Vertica uses this token to access HDFS, such as when executing a query on behalf of
the user. When the token expires, the database automatically renews it, also renewing the
Kerberos ticket if necessary.
The following figure shows how the user, HP Vertica, Hadoop, and Kerberos interact in user
authentication:
Page 129 of 135

When using the HDFSConnector or the HCatalog Connector, or when reading an ORCfile stored
in HDFS, HP Vertica uses the client identity as the preceding figure shows.
HP Vertica Authentication
Automatic processes, such as the Tuple Mover, do not log in the way users do. Instead, HP Vertica
uses a special identity (principal) stored in a keytab file on every database node. (This approach is
also used for HP Vertica clusters that use Kerberos but do not use Hadoop.) After you configure the
keytab file, Vertica uses the principal residing there to automatically obtain and maintain a Kerberos
ticket, much as in the client scenario. In this case, the client does not interact with Kerberos.
The following figure shows the interactions required for HP Vertica authentication:
Page 130 of 135

Each HP Vertica node uses its own principal; it is common to incorporate the name of the node into
the principal name. You can either create one keytab per node, containing only that node's principal,
or you can create a single keytab containing all the principals and distribute the file to all nodes.
Either way, the node uses its principal to get a Kerberos ticket and then uses that ticket to get a
Hadoop token. For simplicity, the preceding figure shows the full set of interactions for only one
database node.
When creating HDFS storage locations HP Vertica uses the principal in the keytab file, not the
principal of the user issuing the CREATELOCATION statement.
See Also
For specific configuration instructions, see Configuring Kerberos.
Page 131 of 135

HP Vertica can connect with Hadoop in several ways, and how you manage Kerberos
authentication varies by connection type. This documentation assumes that you are using Kerberos
for both your HDFS and HP Vertica clusters.
Prerequisite: Setting Up Users and the Keytab File

If you have not already configured Kerberos authentication for HP Vertica, follow the instructions in
Configure HP Vertica for Kerberos Authentication. In particular:
l
Create one Kerberos principal per node.
Place the keytab file(s) in the same location on each database node and set its location in
KerberosKeytabFile (see Specify the Location of the Keytab File).
Set KerberosServiceName to the name of the principal (see Inform HPVertica About the
Kerberos Principal).
HCatalog Connector
You use the HCatalog Connector to query data in Hive. Queries are executed on behalf of HP
Vertica users. If the current user has a Kerberos key, then HP Vertica passes it to the HCatalog
connector automatically. Verify that all users who need access to Hive have been granted access
to HDFS.
In addition, in your Hadoop configuration files (core-site.xml in most distributions), make sure that
you enable all Hadoop components to impersonate the HP Vertica user. The easiest way to do this
is to set the proxyuser property using wildcards for all users on all hostsand in all groups. Consult
your Hadoop documentation for instructions. Make sure you do this before running hcatUtil (see
Configuring HP Vertica for HCatalog).
HDFS Connector
The HDFSConnector loads data from HDFS into HP Vertica on behalf of the user, using a User
Defined Source. If the user performing the data load has a Kerberos key, then the UDS uses it to
access HDFS. Verify that all users who use this connector have been granted access to HDFS.
Page 132 of 135

HDFS Storage Location

You can create a database storage location in HDFS. An HDFSstorage location provides improved
performance compared to other HDFSinterfaces (such as the HCatalog Connector).
After you create Kerberos principals for each node, give all of them read and write permissions to
the HDFSdirectory you will use as a storage location. If you plan to back up HDFSstorage
locations, take the following additional steps:
l
Grant Hadoop superuser privileges to the new principals.
Configure backups, including setting the HadoopConfigDir configuration parameter, following

the instructions in Configuring Hadoop and HP Vertica to Enable Backup of HDFS Storage
Configure user impersonation to be able to restore from backups following the instructions in
"Setting Kerberos Parameters" in Configuring HPVertica to Restore HDFSStorage Locations.
Because the keytab file supplies the principal used to create the location, you must have it in place
before creating the storage location. After you deploy keytab files to all database nodes, use the
CREATE LOCATION statement to create the storage location as usual.
Token Expiration
HP Vertica attempts to automatically refresh Hadoop tokens before they expire, but you can also
set a minimum refresh frequency if you prefer. The HadoopFSTokenRefreshFrequency
configuration parameter specifies the frequency in seconds:
=> ALTER DATABASE exampledb SET HadoopFSTokenRefreshFrequency = '86400';
If the current age of the token is greater than the value specified in this parameter, HP Vertica
refreshes the token before accessing data stored in HDFS.
See Also
l
Troubleshooting Kerberos Authentication
Page 133 of 135

Page 134 of 135
We appreciate your feedback!

If you have comments about this document, you can contact the documentation team by email. If
an email client is configured on this system, click the link above and an email window opens with
the following information in the subject line:
Feedback on Hadoop Integration Guide (Vertica Analytic Database 7.1.x)
Just add your feedback to the email and click send.
If no email client is available, copy the information above to a new message in a web mail client,
and send your feedback to vertica-docfeedback@hp.com.
Page 135 of 135

HP Vertica 7.1.x IntegratingApacheHadoop

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

HP Vertica 7.1.x IntegratingApacheHadoop

Transféré par

Droits d'auteur :

Formats disponibles

Hadoop Integration Guide

HP Vertica Analytic Database

Document Release Date: 7/21/2016

Restricted Rights Legend

HP Vertica Analytic Database

How HP Vertica and Apache Hadoop Work Together

HP Vertica's Integrations for Apache Hadoop

Choosing Which Hadoop Interface to Use

Using the HP Vertica Connector for Hadoop MapReduce

HP Vertica Connector for Hadoop Features

Hadoop and HP Vertica Cluster Scaling

Installing the Connector

Accessing HP Vertica Data From Hadoop

Setting the Query to Retrieve Data From HP Vertica

Using a Simple Query to Extract Data From HP Vertica

Using a Parameterized Query and Parameter Lists

Using a Discrete List of Values

Using a Collection Object

Scaling Parameter Lists for the Hadoop Cluster

Using a Query to Retrieve Parameter Values for a Parameterized Query

Configuring Hadoop to Output to HP Vertica

Defining the Output Table

Writing the Reduce Class

HP Vertica Analytic Database

Hadoop Integration Guide

Storing Data in the VerticaRecord

Specifying the Location of the Connector .jar File

Specifying the Database Connection Parameters

Parameters for a Separate Output Database

Example HP Vertica Connector for Hadoop Map Reduce Application

Compiling and Running the Example Application

Compiling the Example (optional)

Running the Example Application

Verifying the Results

Reading Data From HP Vertica in a Streaming Hadoop Job

Writing Data to HP Vertica in a Streaming Hadoop Job

Loading a Text File From HDFS into HP Vertica

Accessing HP Vertica From Pig

Registering the HP Vertica .jar Files

Reading Data From HP Vertica

Writing Data to HP Vertica

Using the HP Vertica Connector for HDFS

HP Vertica Connector for HDFS Requirements

Kerberos Authentication Requirements

Testing Your Hadoop WebHDFS Configuration

Installing the HP Vertica Connector for HDFS

Downloading and Installing the HP Vertica Connector for HDFS Package

Loading the HDFS User Defined Source

Loading Data Using the HP Vertica Connector for HDFS

The HDFS File URL

HP Vertica Analytic Database (7.1.x)

Hadoop Integration Guide

Copying Files in Parallel

Viewing Rejected Rows and Exceptions

Creating an External Table Based on HDFS Files

Installing and Configuring Kerberos on Your HP Vertica Cluster

Preparing Keytab Files for the HP Vertica Connector for HDFS

HP Vertica Connector for HDFS Troubleshooting Tips

User Unable to Connect to Kerberos-Authenticated Hadoop Cluster

Resolving Error 5118

Transfer Rate Errors

Using the HCatalog Connector

Hive, HCatalog, and WebHCat Overview

HCatalog Connection Features