Académique Documents
Professionnel Documents
Culture Documents
Legal Notices
Warranty
The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be
construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.
The information contained herein is subject to change without notice.
Copyright Notice
Copyright 2006 - 2015 Hewlett-Packard Development Company, L.P.
Trademark Notices
Adobe is a trademark of Adobe Systems Incorporated.
Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation.
UNIX is a registered trademark of The Open Group.
Page 2 of 135
Contents
Contents
Cluster Layout
10
12
15
15
Prerequisites
16
16
16
19
Selecting VerticaInputFormat
19
20
20
21
21
21
22
23
23
23
25
25
25
27
Page 3 of 135
27
Passing Parameters to the HP Vertica Connector for Hadoop Map Reduce At Run Time
30
30
30
31
32
36
37
38
40
Using Hadoop Streaming with the HP Vertica Connector for Hadoop Map Reduce
40
40
43
44
46
46
47
48
49
49
50
50
53
53
54
55
56
Page 4 of 135
56
58
58
59
60
61
62
63
63
64
65
67
67
67
68
68
Syntax
69
69
Kerberos
70
Query Performance
70
Examples
70
71
72
HP Vertica Requirements
72
Hadoop Requirements
72
Testing Connectivity
72
74
Page 5 of 135
74
75
76
76
78
78
79
80
81
86
87
88
89
89
90
90
Connection Errors
91
92
SerDe Errors
93
93
93
95
95
HDFSSpace Requirements
96
96
97
97
Page 6 of 135
97
98
99
100
100
101
101
103
104
105
Configuration Overview
106
106
107
107
109
110
111
112
Troubleshooting
113
113
114
114
115
115
116
Page 7 of 135
116
117
118
119
121
122
122
122
123
124
124
126
126
128
129
129
User Authentication
129
HP Vertica Authentication
130
See Also
131
Configuring Kerberos
132
132
HCatalog Connector
132
HDFS Connector
132
133
Token Expiration
133
See Also
133
135
Page 8 of 135
The Hadoop Distributed Filesystem (HDFS)A fault-tolerant data storage system that takes
network structure into account when storing data.
HiveA data warehouse that provides the ability to query data stored in Hadoop.
Some HDFS configurations use Kerberos authentication. HP Vertica integrates with Kerberos to
access HDFS data if needed. See Using Kerberos with Hadoop.
While HP Vertica and Apache Hadoop can work together, HP Vertica contains many of the same
data processing features as Hadoop. For example, using HP Vertica's flex tables, you can
manipulate semi-structured data, which is a task commonly associated with Hadoop. You can also
create user-defined extensions (UDxs) using HP Vertica's SDKthat perform tasks similar to
Hadoop's MapReduce jobs.
The HPVertica Connector for Apache Hadoop MapReduce lets you create Hadoop MapReduce
jobs that retrieve data from HP Vertica. These jobs can also insert data into HP Vertica.
The HP Vertica Connector for HDFSlets HP Vertica access data stored in HDFS.
The HP Vertica Connector for HCatalog lets HP Vertica query data stored in a Hive database
the same way you query data stored natively in an HP Vertica schema.
The HP Vertica Storage Location for HDFS lets HP Vertica store data on HDFS. If you are using
the HPVertica for SQLon Hadoop product, this is how all of your data is stored.
Page 9 of 135
Cluster Layout
In the Enterprise Edition product, your HP Vertica and Hadoop clusters must be set up on separate
nodes, ideally connected by a high-bandwidth network connection. This is different from the
configuration for HPVertica for SQLon Hadoop, in which HP Vertica nodes are co-located on
Hadoop nodes.
The following figure illustrates the Enterprise Edition configuration:
The network is a key performance component of any well-configured cluster. When HP Vertica
stores data to HDFS it writes and reads data across the network.
The layout shown in the figure calls for two networks, and there are benefits to adding a third:
l
Database Private Network: HP Vertica uses a private network for command and control and
moving data between nodes in support of its database functions. In some networks, the
command and control and passing of data are split across two networks.
Database/Hadoop Shared Network: Each HP Vertica node must be able to connect to each
Hadoop data node and the Name Node. Hadoop best practices generally require a dedicated
network for the Hadoop cluster. This is not a technical requirement, but a dedicated network
Page 10 of 135
improves Hadoop performance. HP Vertica and Hadoop should share the dedicated Hadoop
network.
l
Optional Client Network: Outside clients may access the clustered networks through a client
network. This is not an absolute requirement, but the use of a third network that supports client
connections to either HP Vertica or Hadoop can improve performance. If the configuration does
not support a client network, than client connections should use the shared network.
Page 11 of 135
Hadoop MapReduce
Page 12 of 135
If you are using the HPVertica for SQLon Hadoop product, which is licensed for HDFSdata only,
there are some additional considerations. See Choosing How to Connect HP Vertica To HDFS.
Page 13 of 135
Page 14 of 135
You need to incorporate data from HP Vertica into your MapReduce job. For example, suppose
you are using Hadoop's MapReduce to process web server logs. You may want to access
sentiment analysis data stored in HP Vertica using Pulse to try to correlate a website visitor with
social media activity.
You are using Hadoop MapReduce to refine data on which you want to perform analytics. You
can have your MapReduce job directly insert data into HP Vertica where you can analyze it in
real time using all of HP Vertica's features.
lets Hadoop store its results in HP Vertica. The Connector can create a table for the Hadoop
data if it does not already exist.
lets applications written in Apache Pig access and store data in HP Vertica.
The Connector runs on each node in the Hadoop cluster, so the Hadoop nodes and HP Vertica
nodes communicate with each other directly. Direct connections allow data to be transferred in
parallel, dramatically increasing processing speed.
The Connector is written in Java, and is compatible with all platforms supported by Hadoop.
Note: To prevent Hadoop from potentially inserting multiple copies of data into HP Vertica, the
HP Vertica Connector for Hadoop Map Reduce disables Hadoop's speculative execution
feature.
Page 15 of 135
Prerequisites
Before you can use the HP Vertica Connector for Hadoop MapReduce, you must install and
configure Hadoop and be familiar with developing Hadoop applications. For details on installing and
using Hadoop, please see the Apache Hadoop Web site.
See HP Vertica 7.1.x Supported Platforms for a list of the versions of Hadoop and Pig that the
connector supports.
You will also need a copy of the HP Vertica JDBC driver which you can also download from the
myVertica portal.
You need to perform the following steps on each node in your Hadoop cluster:
1. Copy the HP Vertica Connector for Hadoop Map Reduce .zip archive you downloaded to a
temporary location on the Hadoop node.
Page 16 of 135
2. Copy the HP Vertica JDBC driver .jar file to the same location on your node. If you haven't
already, you can download this driver from the myVertica portal.
3. Unzip the connector .zip archive into a temporary directory. On Linux, you usually use the
command unzip.
4. Locate the Hadoop home directory (the directory where Hadoop is installed). The location of
this directory depends on how you installed Hadoop (manual install versus a package supplied
by your Linux distribution or Cloudera). If you do not know the location of this directory, you can
try the following steps:
n
See if the HADOOP_HOME environment variable is set by issuing the command echo
$HADOOP_HOME on the command line.
See if Hadoop is in your path by typing hadoop classpath on the command line. If it is,
this command lists the paths of all the jar files used by Hadoop, which should tell you the
location of the Hadoop home directory.
If you installed using a .deb or .rpm package, you can look in /usr/lib/hadoop, as this is
often the location where these packages install Hadoop.
5. Copy the file hadoop-vertica.jar from the directory where you unzipped the connector
archive to the lib subdirectory in the Hadoop home directory.
6. Copy the HP Vertica JDBC driver file (vertica-jdbc-x.x.x.jar) to the lib subdirectory in
the Hadoop home directory ($HADOOP_HOME/lib).
7. Edit the $HADOOP_HOME/conf/hadoop-env.sh file, and find the lines:
# Extra Java CLASSPATH elements.
# export HADOOP_CLASSPATH=
Optional.
Uncomment the export line by removing the hash character (#) and add the absolute path of
the JDBC driver file you copied in the previous step. For example:
export HADOOP_CLASSPATH=$HADOOP_HOME/lib/vertica-jdbc-x.x.x.jar
This environment variable ensures that Hadoop can find the HP Vertica JDBC driver.
8. Also in the $HADOOP_HOME/conf/hadoop-env.sh file, ensure that the JAVA_HOME environment
variable is set to your Java installation.
Page 17 of 135
9. If you want your application written in Pig to be able to access HP Vertica, you need to:
a. Locate the Pig home directory. Often, this directory is in the same parent directory as the
Hadoop home directory.
b. Copy the file named pig-vertica.jar from the directory where you unpacked the
connector .zip file to the lib subdirectory in the Pig home directory.
c. Copy the HP Vertica JDBC driver file (vertica-jdbc-x.x.x.jar) to the lib subdirectory
in the Pig home directory.
Page 18 of 135
Give the HP VerticaInputFormat class a query to be used to extract data from HP Vertica.
Selecting VerticaInputFormat
The first step to reading HP Vertica data from within a Hadoop job is to set its input format. You
usually set the input format within the run() method in your Hadoop application's class. To set up
the input format, pass the job.setInputFormatClass method the VerticaInputFormat.class, as
follows:
public int run(String[] args) throws Exception {
// Set up the configuration and job objects
Configuration conf = getConf();
Job job = new Job(conf);
Setting the input to the VerticaInputFormat class means that the map method will get
VerticaRecord objects as its input.
Page 19 of 135
A parameterized query along with a second query that retrieves the parameter values for the first
query from HP Vertica.
The query you supply must have an ORDER BY clause, since the HP Vertica Connector for
Hadoop Map Reduce uses it to figure out how to segment the query results between the Hadoop
nodes. When it gets a simple query, the connector calculates limits and offsets to be sent to each
node in the Hadoop cluster, so they each retrieve a portion of the query results to process.
Page 20 of 135
Having Hadoop use a simple query to retrieve data from HP Vertica is the least efficient method,
since the connector needs to perform extra processing to determine how the data should be
segmented across the Hadoop nodes.
The HP Vertica Connector for Hadoop Map Reduce tries to evenly distribute the query and
parameters among the nodes in the Hadoop cluster. If the number of parameters is not a multiple of
the number of nodes in the cluster, some nodes will get more parameters to process than others.
Once the connector divides up the parameters among the Hadoop nodes, each node connects to a
host in the HP Vertica database and executes the query, substituting in the parameter values it
received.
This format is useful when you have a discrete set of parameters that will not change over time.
However, it is inflexible because any changes to the parameter list requires you to recompile your
Hadoop job. An added limitation is that the query can contain just a single parameter, because the
setInput method only accepts a single parameter list. The more flexible way to use parameterized
queries is to use a collection to contain the parameters.
Page 21 of 135
statement. Each ArrayList object supplies one set of parameter values, and can contain values
for each parameter in the query.
The following example demonstrates using a collection to pass the parameter values for a query
containing two parameters. The collection object passed to setInput is an instance of the HashSet
class. This object contains four ArrayList objects added within the for loop. This example just
adds dummy values (the loop counter and the string "FOUR"). In your own application, you usually
calculate parameter values in some manner before adding them to the collection.
Note: If your parameter values are stored in HP Vertica, you should specify the parameters
using a query instead of a collection. See Using a Query to Retrieve Parameters for a
Parameterized Query for details.
Page 22 of 135
In addition to the number of parameters in the query, you should make the parameter values yield
roughly the same number of results. Ensuring each [parameter yields the same number of results
helps prevent a single node in the Hadoop cluster from becoming a bottleneck by having to process
more data than the other nodes in the cluster.
When it receives a query for parameters, the connector runs the query itself, then groups the results
together to send out to the Hadoop nodes, along with the parameterized query. The Hadoop nodes
then run the parameterized query using the set of parameter values sent to them by the connector.
get retrieves a single value, either by index value or by name, from the row sent to the map
method.
Page 23 of 135
getOrdinalPosition takes a string containing a column name and returns the column's
number.
getType returns the data type of a column in the row specified by index value or by name. This
method is useful if you are unsure of the data types of the columns returned by the query. The
types are stored as integer values defined by the java.sql.Types class.
The following example shows a Mapper class and map method that accepts VerticaRecord
objects. In this example, no real work is done. Instead two values are selected as the key and value
to be passed on to the reducer.
public static class Map extends
Mapper<LongWritable, VerticaRecord, Text, DoubleWritable> {
// This mapper accepts VerticaRecords as input.
public void map(LongWritable key, VerticaRecord value, Context context)
throws IOException, InterruptedException {
//
//
//
if
}
}
}
If your Hadoop job has a reduce stage, all of the map method output is managed by Hadoop. It is not
stored or manipulated in any way by HP Vertica. If your Hadoop job does not have a reduce stage,
and needs to store its output into HP Vertica, your map method must output its keys as Text objects
and values as VerticaRecord objects.
Page 24 of 135
Set the details of the HP Vertica table where you want to store your data in the
VerticaOutputFormat class.
Create a Reduce class that adds data to a VerticaRecord object and calls its write method to
store the data.
The call to setOutputValueClass tells Hadoop that the output of the Reduce.reduce method is a
VerticaRecord class object. This object represents a single row of an HP Vertica database table.
You tell your Hadoop job to send the data to the connector by setting the output format class to
VerticaOutputFormat.
Page 25 of 135
jobObject
tableName
The name of the table to store Hadoop's output. If this table does not
exist, the HP Vertica Connector for Hadoop Map Reduce
automatically creates it. The name can be a full
database.schema.table reference.
truncate
"columnName1 dataType1"
The first two parameters are required. You can add as many column definitions as you need in your
output table.
You usually call the setOutput method in your Hadoop class's run method, where all other setup
work is done. The following example sets up an output table named mrtarget that contains 8
columns, each containing a different data type:
// Sets the output format for storing data in Vertica. It defines the
// table where data is stored, and the columns it will be stored in.
VerticaOutputFormat.setOutput(job, "mrtarget", true, "a int",
"b boolean", "c char(1)", "d date", "f float", "t timestamp",
"v varchar", "z varbinary");
If the truncate parameter is set to true for the method call, and target table already exists in the HP
Vertica database, the connector deletes the table contents before storing the Hadoop job's output.
Note: If the table already exists in the database, and the method call truncate parameter is set
to false, the HP Vertica Connector for Hadoop Map Reduce adds new application output to
the existing table. However, the connector does not verify that the column definitions in the
existing table match those defined in the setOutput method call. If the new application output
values cannot be converted to the existing column values, your Hadoop application can throw
casting exceptions.
Page 26 of 135
The column to store the value in. This is either an integer (the column number) or a
String (the column name, as defined in the table definition).
Note: The set method throws an exception if you pass it the name of a column that does
not exist. You should always use a try/catch block around any set method call that uses
a column name.
Page 27 of 135
value
The value to store in the column. The data type of this value must match the definition of
the column in the table.
Note: If you do not have the set method validate that the data types of the value and the
column match, the HP Vertica Connector for Hadoop Map Reduce throws a
ClassCastException if it finds a mismatch when it tries to commit the data to the database.
This exception causes a rollback of the entire result. By having the set method validate the
data type of the value, you can catch and resolve the exception before it causes a rollback.
In addition to the set method, you can also use the setFromString method to have the HP Vertica
Connector for Hadoop Map Reduce convert the value from String to the proper data type for the
column:
VerticaRecord.setFromString(column, "value");
column
value
A String containing the value to store in the column. If the String cannot be converted
to the correct data type to be stored in the column, setFromString throws an exception
(ParseException for date values, NumberFormatException numeric values).
Your reduce method must output a value for every column in the output table. If you want a column
to have a null value you must explicitly set it.
After it populates the VerticaRecord object, your reduce method calls the Context.write
method, passing it the name of the table to store the data in as the key, and the VerticaRecord
object as the value.
The following example shows a simple Reduce class that stores data into HP Vertica. To make the
example as simple as possible, the code doesn't actually process the input it receives, and instead
just writes dummy data to the database. In your own application, you would process the key and
values into data that you then store in the VerticaRecord object.
public static class Reduce extends
Reducer<Text, DoubleWritable, Text, VerticaRecord> {
// Holds the records that the reducer writes its values to.
VerticaRecord record = null;
Page 28 of 135
try {
record.set(0, 125);
} catch (Exception e) {
// Handle the improper data type here.
e.printStackTrace();
}
// You can also set column values by name rather than by column
// number. However, this requires a try/catch since specifying a
// non-existent column name will throw an exception.
try {
// The second column, named "b", contains a Boolean value.
record.set("b", true);
} catch (Exception e) {
// Handle an improper column name here.
e.printStackTrace();
}
// Column 2 stores a single char value.
record.set(2, 'c');
Page 29 of 135
Page 30 of 135
the name of the database, and the user name. The common parameters for accessing the database
appear in the following table. Usually, you will only need the basic parameters listed in this table in
order to start your Hadoop application.
Parameter
Description
Required Default
mapred.vertica.hostnames
Yes
none
5433
No
mapred.vertica.database
Yes
Yes
database.
mapred.vertica.password
No
empty
database.
You pass the parameters to the connector using the -D command line switch in the command you
use to start your Hadoop application. For example:
hadoop jar myHadoopApp.jar com.myorg.hadoop.myHadoopApp \
-libjars $HADOOP_HOME/lib/hadoop-vertica.jar \
-Dmapred.vertica.hostnames=Vertica01,Vertica02,Vertica03,Vertica04 \
-Dmapred.vertica.port=5433 -Dmapred.vertica.username=exampleuser \
-Dmapred.vertica.password=password123 -Dmapred.vertica.database=ExampleDB
Page 31 of 135
lists the parameters that you use to supply your Hadoop application with the connection information
for the output database (the one it writes its data to). None of these parameters is required. If you do
not assign a value to one of these output parameters, it inherits its value from the input database
parameters.
Parameter
Description
Default
mapred.vertica.hostnames.output
Input
5433
database.
mapred.vertica.database.output
Input
database
name
mapred.vertica.username.output
Input
output database.
database
user name
mapred.vertica.password.output
Input
output database.
database
password
Page 32 of 135
import java.util.Iterator;
import java.util.List;
import java.math.BigDecimal;
import java.sql.Date;
import java.sql.Timestamp;
// Needed when using the setFromString method, which throws this exception.
import java.text.ParseException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.vertica.hadoop.VerticaConfiguration;
import com.vertica.hadoop.VerticaInputFormat;
import com.vertica.hadoop.VerticaOutputFormat;
import com.vertica.hadoop.VerticaRecord;
// This is the class that contains the entire Hadoop example.
public class VerticaExample extends Configured implements Tool {
public static class Map extends
Mapper<LongWritable, VerticaRecord, Text, DoubleWritable> {
// This mapper accepts VerticaRecords as input.
public void map(LongWritable key, VerticaRecord value, Context context)
throws IOException, InterruptedException {
//
//
//
if
}
}
}
Page 33 of 135
try {
record.set(0, 125);
} catch (Exception e) {
// Handle the improper data type here.
e.printStackTrace();
}
// You can also set column values by name rather than by column
// number. However, this requires a try/catch since specifying a
// non-existent column name will throw an exception.
try {
// The second column, named "b", contains a Boolean value.
record.set("b", true);
} catch (Exception e) {
// Handle an improper column name here.
e.printStackTrace();
}
// Column 2 stores a single char value.
record.set(2, 'c');
Page 34 of 135
//
//
//
//
//
try {
record.setFromString(4, "234.567");
} catch (ParseException e) {
// Thrown if the string cannot be parsed into the data type
// to be stored in the column.
e.printStackTrace();
}
// Column 5 stores a timestamp
record.set(5, new java.sql.Timestamp(
Calendar.getInstance().getTimeInMillis()));
// Column 6 stores a varchar
record.set(6, "example string");
// Column 7 stores a varbinary
record.set(7, new byte[10]);
// Once the columns are populated, write the record to store
// the row into the database.
context.write(new Text("mrtarget"), record);
}
}
@Override
public int run(String[] args) throws Exception {
// Set up the configuration and job objects
Configuration conf = getConf();
Job job = new Job(conf);
conf = job.getConfiguration();
conf.set("mapreduce.job.tracker", "local");
job.setJobName("vertica test");
// Set the input format to retrieve data from
// Vertica.
job.setInputFormatClass(VerticaInputFormat.class);
// Set the output format of the mapper. This is the interim
// data format passed to the reducer. Here, we will pass in a
// Double. The interim data is not processed by Vertica in any
// way.
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(DoubleWritable.class);
// Set the output format of the Hadoop application. It will
// output VerticaRecords that will be stored in the
// database.
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(VerticaRecord.class);
job.setOutputFormatClass(VerticaOutputFormat.class);
job.setJarByClass(VerticaExample.class);
Page 35 of 135
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// Sets the output format for storing data in Vertica. It defines the
// table where data is stored, and the columns it will be stored in.
VerticaOutputFormat.setOutput(job, "mrtarget", true, "a int",
"b boolean", "c char(1)", "d date", "f float", "t timestamp",
"v varchar", "z varbinary");
// Sets the query to use to get the data from the Vertica database.
// Query using a list of parameters.
VerticaInputFormat.setInput(job, "select * from allTypes where key = ?",
"1", "2", "3");
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new VerticaExample(),
args);
System.exit(res);
}
}
Note: If your node does not have Perl installed, can can run this script on a system that
Page 36 of 135
does have Perl installed, then copy the datasource file to a database node.
5. Run the following query to load data from the datasource file into the table:
COPY allTypes COLUMN OPTION (varbincol FORMAT 'hex', bincol FORMAT 'hex')
FROM '/path-to-datasource/datasource' DIRECT;
Replace path-to-datasource with the absolute path to the datasource file located in the
same directory where you ran MakeAllTypes.pl.
Page 37 of 135
3. If it is not already set, set the environment variable HADOOP_HOME to the Hadoop home
directory:
export HADOOP_HOME=path_to_Hadoop_home
If you installed Hadoop using an .rpm or .deb package, Hadoop is usually installed in
/usr/lib/hadoop:
export HADOOP_HOME=/usr/lib/hadoop
4. Save the example source code to a file named VerticaExample.java on your Hadoop node.
5. In the same directory where you saved VerticaExample.java, create a directory named
classes. On Linux, the command is:
mkdir classes
Note: If you receive errors about missing Hadoop classes, check the name of the
hadoop-code.jar file. Most Hadoop installers (including the Cloudera) create a symbolic
link named hadoop-core.jar to a version specific .jar file (such as hadoop-core0.20.203.0.jar). If your Hadoop installation did not create this link, you will have to
supply the .jar file name with the version number.
When the compilation finishes, you will have a file named hadoop-vertica-example.jar in the
same directory as the VerticaExample.java file. This is the file you will have Hadoop run.
Page 38 of 135
-Dmapred.vertica.port=portNumber \
-Dmapred.vertica.username=userName \
-Dmapred.vertica.password=dbPassword \
-Dmapred.vertica.database=databaseName
This command tells Hadoop to run your application's .jar file, and supplies the parameters needed
for your application to connect to your HP Vertica database. Fill in your own values for the
hostnames, port, user name, password, and database name for your HP Vertica database.
After entering the command line, you will see output from Hadoop as it processes data that looks
similar to the following:
12/01/11 10:41:19 INFO mapred.JobClient: Running job: job_201201101146_0005
12/01/11 10:41:20 INFO mapred.JobClient: map 0% reduce 0%
12/01/11 10:41:36 INFO mapred.JobClient: map 33% reduce 0%
12/01/11 10:41:39 INFO mapred.JobClient: map 66% reduce 0%
12/01/11 10:41:42 INFO mapred.JobClient: map 100% reduce 0%
12/01/11 10:41:45 INFO mapred.JobClient: map 100% reduce 22%
12/01/11 10:41:51 INFO mapred.JobClient: map 100% reduce 100%
12/01/11 10:41:56 INFO mapred.JobClient: Job complete: job_201201101146_0005
12/01/11 10:41:56 INFO mapred.JobClient: Counters: 23
12/01/11 10:41:56 INFO mapred.JobClient:
Job Counters
12/01/11 10:41:56 INFO mapred.JobClient:
Launched reduce tasks=1
12/01/11 10:41:56 INFO mapred.JobClient:
SLOTS_MILLIS_MAPS=21545
12/01/11 10:41:56 INFO mapred.JobClient:
Total time spent by all reduces waiting
after reserving slots (ms)=0
12/01/11 10:41:56 INFO mapred.JobClient:
Total time spent by all maps waiting after
reserving slots (ms)=0
12/01/11 10:41:56 INFO mapred.JobClient:
Launched map tasks=3
12/01/11 10:41:56 INFO mapred.JobClient:
SLOTS_MILLIS_REDUCES=13851
12/01/11 10:41:56 INFO mapred.JobClient:
File Output Format Counters
12/01/11 10:41:56 INFO mapred.JobClient:
Bytes Written=0
12/01/11 10:41:56 INFO mapred.JobClient:
FileSystemCounters
12/01/11 10:41:56 INFO mapred.JobClient:
FILE_BYTES_READ=69
12/01/11 10:41:56 INFO mapred.JobClient:
HDFS_BYTES_READ=318
12/01/11 10:41:56 INFO mapred.JobClient:
FILE_BYTES_WRITTEN=89367
12/01/11 10:41:56 INFO mapred.JobClient:
File Input Format Counters
12/01/11 10:41:56 INFO mapred.JobClient:
Bytes Read=0
12/01/11 10:41:56 INFO mapred.JobClient:
Map-Reduce Framework
12/01/11 10:41:56 INFO mapred.JobClient:
Reduce input groups=1
12/01/11 10:41:56 INFO mapred.JobClient:
Map output materialized bytes=81
12/01/11 10:41:56 INFO mapred.JobClient:
Combine output records=0
12/01/11 10:41:56 INFO mapred.JobClient:
Map input records=3
12/01/11 10:41:56 INFO mapred.JobClient:
Reduce shuffle bytes=54
12/01/11 10:41:56 INFO mapred.JobClient:
Reduce output records=1
12/01/11 10:41:56 INFO mapred.JobClient:
Spilled Records=6
12/01/11 10:41:56 INFO mapred.JobClient:
Map output bytes=57
12/01/11 10:41:56 INFO mapred.JobClient:
Combine input records=0
12/01/11 10:41:56 INFO mapred.JobClient:
Map output records=3
12/01/11 10:41:56 INFO mapred.JobClient:
SPLIT_RAW_BYTES=318
12/01/11 10:41:56 INFO mapred.JobClient:
Reduce input records=3
Note: The version of the example supplied in the Hadoop Connector download package will
Page 39 of 135
| b | c |
d
|
f
|
t
|
v
|
z
-----+---+---+------------+---------+-------------------------+----------------+----------------------------------------125 | t | c | 2012-01-11 | 234.567 | 2012-01-11 10:41:48.837 | example string |
\000\000\000\000\000\000\000\000\000\000
(1 row)
Page 40 of 135
Note: The VerticaStreamingInput class is within the deprecated namespace because the
current version of Hadoop (as of 0.20.1) has not defined a current API for streaming. Instead,
the streaming classes conform to the Hadoop version 0.18 API.
In addition to the standard command-line parameters that tell Hadoop how to access your
database, there are additional streaming-specific parameters you need to use that supply Hadoop
with the query it should use to extract data from HP Vertica and other query-related options.
Parameter
Description
Required
Default
mapred.vertica.input.query
Yes
none
If query has
parameter and
no discrete
parameters
supplied
mapred.vertica.query.params
If query has
query.
parameter and
no parameter
query supplied
mapred.vertica.input.delimiter
0xa
No
0xb
Page 41 of 135
-Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,... \
-Dmapred.vertica.database=ExampleDB \
-Dmapred.vertica.username=ExampleUser \
-Dmapred.vertica.password=password123 \
-Dmapred.vertica.port=5433 \
-Dmapred.vertica.input.query="SELECT key, intcol, floatcol, varcharcol FROM allTypes
ORDER BY key" \
-Dmapred.vertica.input.delimiter=, \
-Dmapred.map.tasks=1 \
-inputformat com.vertica.hadoop.deprecated.VerticaStreamingInput \
-input /tmp/input -output /tmp/output -reducer /bin/cat -mapper /bin/cat
The results of this command are saved in the /tmp/output directory on your HDFS filesystem. On
a four-node Hadoop cluster, the results would be:
# $HADOOP_HOME/bin/hadoop dfs -ls /tmp/output
Found 5 items
drwxr-xr-x
- release supergroup
/tmp/output/part-00000
/tmp/output/part-00001
/tmp/output/part-00002
/tmp/output/part-00003
/tmp/output/part-00001
/tmp/output/part-00002
/tmp/output/part-00003
1015,15,15.1515,FIFTEEN
Notes
l
Even though the input is coming from HP Vertica, you need to supply the -input parameter to
Hadoop for it to process the streaming job.
Page 42 of 135
The -Dmapred.map.tasks=1 parameter prevents multiple Hadoop nodes from reading the same
data from the database, which would result in Hadoop processing multiple copies of the data.
Description
Required
Default
mapred.vertica.output.table.name
Yes
none
If the table
does not
already
exist in the
database
details.
mapred.vertica.output.table.drop
No
false
No
0x7 (ASCII
bell
character)
mapred.vertica.output.terminator
No
0x8 (ASCII
backspace)
Page 43 of 135
The following example demonstrates reading two columns of data data from an HP Vertica
database table named allTypes and writing it back to the same database in a table named
hadoopout. The command provides the definition for the table, so you do not have to manually
create the table beforehand.
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \
-libjars $HADOOP_HOME/lib/hadoop-vertica.jar \
-Dmapred.vertica.output.table.name=hadoopout \
-Dmapred.vertica.output.table.def="intcol integer, varcharcol varchar" \
-Dmapred.vertica.output.table.drop=true \
-Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \
-Dmapred.vertica.database=ExampleDB \
-Dmapred.vertica.username=ExampleUser \
-Dmapred.vertica.password=password123 \
-Dmapred.vertica.port=5433 \
-Dmapred.vertica.input.query="SELECT intcol, varcharcol FROM allTypes ORDER BY key" \
-Dmapred.vertica.input.delimiter=, \
-Dmapred.vertica.output.delimiter=, \
-Dmapred.vertica.input.terminator=0x0a \
-Dmapred.vertica.output.terminator=0x0a \
-inputformat com.vertica.hadoop.deprecated.VerticaStreamingInput \
-outputformat com.vertica.hadoop.deprecated.VerticaStreamingOutput \
-input /tmp/input \
-output /tmp/output \
-reducer /bin/cat \
-mapper /bin/cat
After running this command, you can view the result by querying your database:
=> SELECT * FROM hadoopout;
intcol | varcharcol
--------+-----------1 | ONE
5 | ONE
9 | ONE
2 | ONE
6 | ONE
0 | ONE
4 | ONE
8 | ONE
3 | ONE
7 | ONE
(10 rows)
Page 44 of 135
data through several different interfaces. Streaming is best used for smaller, one-time loads. If
you need to load large amounts of data on a regular basis, you should create a standard
Hadoop map/reduce job in Java or a script in Pig.
For example, suppose you have a text file in the HDFS you want to load contains values delimited
by pipe characters (|), with each line of the file is terminated by a carriage return:
# $HADOOP_HOME/bin/hadoop dfs -cat /tmp/textdata.txt
1|1.0|ONE
2|2.0|TWO
3|3.0|THREE
In this case, the line delimiter poses a problem. You can easily include the column delimiter in the
Hadoop command line arguments. However, it is hard to specify a carriage return in the Hadoop
command line. To get around this issue, you can write a mapper script to strip the carriage return
and replace it with some other character that is easy to enter in the command line and also does not
occur in the data.
Below is an example of a mapper script written in Python. It performs two tasks:
l
Strips the carriage returns from the input text and terminates each line with a tilde (~).
Adds a key value (the string "streaming") followed by a tab character at the start of each line of
the text file. The mapper script needs to do this because the streaming job to read text files skips
the reducer stage. The reducer isn't necessary, since the all of the data being read in text file
should be stored in the HP Vertica tables. However, VerticaStreamingOutput class requires
key and values pairs, so the mapper script adds the key.
#!/usr/bin/python
import sys
for line in sys.stdin.readlines():
# Get rid of carriage returns.
# CR is used as the record terminator by Streaming.jar
line = line.strip();
# Add a key. The key value can be anything.
# The convention is to use the name of the
# target table, as shown here.
sys.stdout.write("streaming\t%s~\n" % line)
The Hadoop command to stream text files from the HDFS into HP Vertica using the above mapper
script appears below.
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \
-libjars $HADOOP_HOME/lib/hadoop-vertica.jar \
-Dmapred.reduce.tasks=0 \
Page 45 of 135
-Dmapred.vertica.output.table.name=streaming \
-Dmapred.vertica.output.table.def="intcol integer, floatcol float, varcharcol
varchar" \
-Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \
-Dmapred.vertica.port=5433 \
-Dmapred.vertica.username=ExampleUser \
-Dmapred.vertica.password=password123 \
-Dmapred.vertica.database=ExampleDB \
-Dmapred.vertica.output.delimiter="|" \
-Dmapred.vertica.output.terminator="~" \
-input /tmp/textdata.txt \
-output output \
-mapper "python path-to-script/mapper.py" \
-outputformat com.vertica.hadoop.deprecated.VerticaStreamingOutput
Notes
l
The -Dmapred.reduce-tasks=0 parameter disables the streaming job's reducer stage. It does
not need a reducer since the mapper script processes the data into the format that the
VerticaStreamingOutput class expects.
Even though the VerticaStreamingOutput class is handling the output from the mapper, you
need to supply a valid output directory to the Hadoop command.
The result of running the command is a new table in the HP Vertica database:
=> SELECT * FROM streaming;
intcol | floatcol | varcharcol
--------+----------+-----------3 |
3 | THREE
1 |
1 | ONE
2 |
2 | TWO
(3 rows)
Page 46 of 135
REGISTER 'path-to-pig-home/lib/vertica-jdbc-7.1..x.jar';
REGISTER 'path-to-pig-home/lib/pig-vertica.jar';
These commands ensure that Pig can locate the HP Vertica JDBC classes, as well as the
interface for the connector.
database
port
username
password
The password to use when connecting to the database. This is the only optional
parameter. If not present, the connector assumes the password is empty.
The following Pig Latin command extracts all of the data from the table named allTypes using a
simple query:
A = LOAD 'sql://{SELECT * FROM allTypes ORDER BY key}' USING
com.vertica.pig.VerticaLoader('Vertica01,Vertica02,Vertica03',
'ExampleDB','5433','ExampleUser','password123');
This example uses a parameter and supplies a discrete list of parameter values:
A = LOAD 'sql://{SELECT * FROM allTypes WHERE key = ?};{1,2,3}' USING
com.vertica.pig.VerticaLoader('Vertica01,Vertica02,Vertica03',
'ExampleDB','5433','ExampleUser','password123');
Page 47 of 135
This final example demonstrates using a second query to retrieve parameters from the HP Vertica
database.
A = LOAD 'sql://{SELECT * FROM allTypes WHERE key = ?};sql://{SELECT DISTINCT key FROM
allTypes}'
USING com.vertica.pig.VerticaLoader('Vertica01,Vertica02,Vertica03','ExampleDB',
'5433','ExampleUser','password123');
The following example demonstrates saving a relation into a table named hadoopOut which must
already exist in the database:
STORE A INTO '{hadoopOut}' USING
com.vertica.pig.VerticaStorer('Vertica01,Vertica02,Vertica03','ExampleDB','5433',
'ExampleUser','password123');
This example shows how you can add a table definition to the table name, so that the table is
created in HP Vertica if it does not already exist:
STORE A INTO '{outTable(a int, b int, c float, d char(10), e varchar, f boolean, g date,
h timestamp, i timestamptz, j time, k timetz, l varbinary, m binary,
n numeric(38,0), o interval)}' USING
com.vertica.pig.VerticaStorer('Vertica01,Vertica02,Vertica03','ExampleDB','5433',
'ExampleUser','password123');
Note: If the table already exists in the database, and the definition that you supply differs from
the table's definition, the table is not dropped and recreated. This may cause data type errors
when data is being loaded.
Page 48 of 135
Page 49 of 135
In addition, the WebHDFS system must be enabled. See your Hadoop distribution's documentation
for instructions on configuring and enabling WebHDFS.
Note: HTTPfs (also known as HOOP) is another method of accessing files stored in an HDFS.
It relies on a separate server process that receives requests for files and retrieves them from
the HDFS. Since it uses a REST API that is compatible with WebHDFS, it could theoretically
work with the connector. However, the connector has not been tested with HTTPfs and HP
does not support using the HP Vertica Connector for HDFS with HTTPfs. In addition, since all
of the files retrieved from HDFS must pass through the HTTPfs server, it is less efficient than
WebHDFS, which lets HP Vertica nodes directly connect to the Hadoop nodes storing the file
blocks.
The Kerberos server must be accessible from every node in your HP Vertica cluster.
You must have Kerberos principals (users) that map to Hadoop users. You use these principals
to authenticate your HP Vertica users with the Hadoop cluster.
Before you can use the HP Vertica Connector for HDFS with Kerberos you must Install the
Kerberos client and libraries on your HP Vertica cluster.
Page 50 of 135
following command:
echo -e "A|1|2|3\nB|4|5|6" | hadoop fs -put - /tmp/test.txt
2. Log into a host in your HP Vertica database using the database administrator account.
3. If you are using Kerberos authentication, authenticate with the Kerberos server using the
keytab file for a user who is authorized to access the file. For example, to authenticate as an
user named exampleuser@MYCOMPANY.COM, use the command:
$ kinit exampleuser@MYCOMPANY.COM -k -t /path/exampleuser.keytab
Where path is the path to the keytab file you copied over to the node. You do not receive any
message if you authenticate successfully. You can verify that you are authenticated by using
the klist command:
$ klistTicket cache: FILE:/tmp/krb5cc_500
Default principal: exampleuser@MYCOMPANY.COM
Valid starting
Expires
Service principal
07/24/13 14:30:19 07/25/13 14:30:19 krbtgt/MYCOMPANY.COM@MYCOMPANY.COM
renew until 07/24/13 14:30:19
If you are not using Kerberos authentication, run the following command from the Linux
command line:
curl -i -L
"http://
hadoopNameNode:50070/webhdfs/v1/tmp/test.txt?op=OPEN&user.name=hadoopUserName"
Replacing hadoopNameNode with the hostname or IP address of the name node in your
Hadoop cluster, /tmp/test.txt with the path to the file in the Hadoop filesystem you
located in step 1, and hadoopUserName with the user name of a Hadoop user that has read
access to the file.
If successful, the command produces output similar to the following:
HTTP/1.1 200 OKServer: Apache-Coyote/1.1
Set-Cookie:
hadoop.auth="u=hadoopUser&p=password&t=simple&e=1344383263490&s=n8YB/CHFg56qNmRQRT
qO0IdRMvE="; Version=1; Path=/
Page 51 of 135
Content-Type: application/octet-stream
Content-Length: 16
Date: Tue, 07 Aug 2012 13:47:44 GMT
A|1|2|3
B|4|5|6
If you are using Kerberos authentication, run the following command from the Linux
command line:
curl --negotiate -i -L -u:anyUser
http://hadoopNameNode:50070/webhdfs/v1/tmp/test.txt?op=OPEN
Replace hadoopNameNode with the hostname or IP address of the name node in your
Hadoop cluster, and /tmp/test.txt with the path to the file in the Hadoop filesystem you
located in step 1.
If successful, the command produces output similar to the following:
HTTP/1.1 401 UnauthorizedContent-Type: text/html; charset=utf-8
WWW-Authenticate: Negotiate
Content-Length: 0
Server: Jetty(6.1.26)
HTTP/1.1 307 TEMPORARY_REDIRECT
Content-Type: application/octet-stream
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: hadoop.auth="u=exampleuser&p=exampleuser@MYCOMPANY.COM&t=kerberos&
e=1375144834763&s=iY52iRvjuuoZ5iYG8G5g12O2Vwo=";Path=/
Location:
http://hadoopnamenode.mycompany.com:1006/webhdfs/v1/user/release/docexample/test.t
xt?
op=OPEN&delegation=JAAHcmVsZWFzZQdyZWxlYXNlAIoBQCrfpdGKAUBO7CnRju3TbBSlID_
osB658jfGf
RpEt8-u9WHymRJXRUJIREZTIGRlbGVnYXRpb24SMTAuMjAuMTAwLjkxOjUwMDcw&offset=0
Content-Length: 0
Server: Jetty(6.1.26)
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Length: 16
Server: Jetty(6.1.26)
A|1|2|3
B|4|5|6
If the curl command fails, you must review the error messages and resolve any issues before using
the HP Vertica Connector for HDFS with your Hadoop cluster. Some debugging steps include:
Page 52 of 135
Verify that the Hadoop user you specified exists and has read access to the file you are
attempting to retrieve.
For example, if you downloaded the Red Hat installation package to the dbadmin home
directory, the command to install the package is:
rpm -Uvh /home/dbadmin/vertica-hdfs-connectors-7.1..x86_64.RHEL5.rpm
Page 53 of 135
Once you have installed the connector package on each host, you need to load the connector library
into HP Vertica. See Loading the HDFS User Defined Source for instructions.
Note: You only need to run an installation script once in order to create the User Defined
Source in the HP Vertica catalog. You do not need to run the install script on each node.
The SQL install script loads the HP Vertica Connector for HDFS library and defines the HDFS User
Defined Source named HDFS. The output of successfully running the installation script looks like
this:
version
------------------------------------------Vertica Analytic Database v7.1.x
(1 row)
CREATE LIBRARY
CREATE SOURCE FUNCTION
Once the install script finishes running, the connector is ready to use.
Page 54 of 135
tableName
WebHDFSFileURL
A string containing one or more URLs that identify the file or files to be read. See
below for details. Use commas to separate multiple URLs .
Important: You must replace any commas in the URLs with the escape
sequence %2c. For example, if you are loading a file named doe,john.txt,
change the file's name in the URL to doe%2cjohn.txt.
username
The username of a Hadoop user that has permissions to access the files you
want to copy. If you are using Kerberos, omit this argument.
speed
The minimum data transmission rate, expressed in bytes per second, that the
connector allows. The connector breaks any connection between the Hadoop
and HP Vertica clusters that transmits data slower than this rate for more than 1
minute. After the connector breaks a connection for being too slow, it attempts
to connect to another node in the Hadoop cluster. This new connection can
supply the data that the broken connection was retrieving. The connector
terminates the COPYstatement and returns an error message if:
l
The previous transfer attempts from all other Hadoop nodes that have the file
also closed because they were too slow.
Page 55 of 135
NameNode
Port
The port number on which the WebHDFS service is running. This number is
usually 50070 or 14000, but may be different in your Hadoop installation.
webhdfs/v1/
The protocol being used to retrieve the file. This portion of the URL is always the
same. It tells Hadoop to use version 1 of the WebHDFS API.
HDFSFilePath
The path from the root of the HDFS filesystem to the file or files you want to load.
This path can contain standard Linux wildcards.
Important: Any wildcards you use to specify multiple input files must resolve
to files only. They must not include any directories. For example, if you
specify the path /user/HadoopUser/output/*, and the output directory
contains a subdirectory, the connector returns an error message.
The following example shows how to use the HP Vertica Connector for HDFS to load a single file
named /tmp/test.txt. The Hadoop cluster's name node is named hadoop.
=> COPY testTable SOURCE Hdfs(url='http://hadoop:50070/webhdfs/v1/tmp/test.txt',
username='hadoopUser');
Rows Loaded
------------2
(1 row)
Page 56 of 135
You specify multiple files to be loaded in your Hdfs function call by:
l
Supplying multiple comma-separated URLs in the url parameter of the Hdfs user-defined source
function call
Loading multiple files through the HP Vertica Connector for HDFS results in a efficient load. The
HP Vertica hosts connect directly to individual nodes in the Hadoop cluster to retrieve files. If
Hadoop has broken files into multiple chunks, the HP Vertica hosts directly connect to the nodes
storing each chunk.
The following example shows how to load all of the files whose filenames start with "part-" located
in the /user/hadoopUser/output directory on the HDFS. If there are at least as many files in this
directory as there are nodes in the HP Vertica cluster, all nodes in the cluster load data from the
HDFS.
=> COPY Customers SOURCE-> Hdfs
(url='http://hadoop:50070/webhdfs/v1/user/hadoopUser/output/part-*',
username='hadoopUser');
Rows Loaded
------------40008
(1 row)
To load data from multiple directories on HDFS at once use multiple comma-separated URLs in the
URL string:
=> COPY Customers SOURCE-> Hdfs
(url='http://hadoop:50070/webhdfs/v1/user/HadoopUser/output/part-*,
http://hadoop:50070/webhdfs/v1/user/AnotherUser/part-*',
username='H=hadoopUser');
Rows Loaded
------------80016
(1 row)
Note: HP Vertica statements must be less than 65,000 characters long. If you supply too
many long URLs in a single statement, you could go over this limit. Normally, you would only
approach this limit if you are automatically generating of the COPY statement using a program
or script.
Page 57 of 135
Page 58 of 135
Later, after another Hadoop job adds contents to the output directory, querying the table produces
different results:
=> SELECT * FROM hadoopExample;
A
| B | C | D
-------+----+----+---test3 | 10 | 11 | 12
test3 | 13 | 14 | 15
test2 | 6 | 7 | 8
test2 | 9 | 0 | 10
test1 | 1 | 2 | 3
test1 | 3 | 4 | 5
(6 rows)
Page 59 of 135
Note that it is not until you actually query the table that the the connector attempts to read the file.
Only then does it return an error.
2. Update the Kerberos configuration file (/etc/krb5.conf) to reflect your site's Kerberos
configuration. The easiest method of doing this is to copy the /etc/krb5.conf file from your
Kerberos Key Distribution Center (KDC) server.
Page 60 of 135
3. Copy the keytab files for the users to a directory on the node (see Preparing Keytab Files for
the HP Vertica Connector for HDFS). The absolute path to these files must be the same on
every node in your HP Vertica cluster.
4. Ensure the keytab files are readable by the database administrator's Linux account (usually
dbadmin). The easiest way to do this is to change ownership of the files to dbadmin:
sudo chown dbadmin *.keytab
2. Update the Kerberos configuration file (/etc/krb5.conf) to reflect your site's Kerberos
configuration. The easiest method of doing this is to copy the /etc/krb5.conf file from your
Kerberos Key Distribution Center (KDC) server.
3. Copy the keytab files for the users to a directory on the node (see Preparing Keytab Files for
the HP Vertica Connector for HDFS). The absolute path to these files must be the same on
every node in your HP Vertica cluster.
Page 61 of 135
4. Ensure the keytab files are readable by the database administrator's Linux account (usually
dbadmin). The easiest way to do this is to change ownership of the files to dbadmin:
sudo chown dbadmin *.keytab
If you have access to the root account on the Kerberos Key Distribution Center (KDC)
system, log into it and use the kadmin.local command. (If this command is not in the
system search path, try /usr/kerberos/sbin/kadmin.local.)
If you do not have access to the root account on the Kerberos KDC, then you can use the
kadmin command from a Kerberos client system as long as you have the password for the
Kerberos administrator account. When you start kadmin, it will prompt you for the Kerberos
administrator's password. You may need to have root privileges on the client system in
order to run kadmin.
2. Use kadmin's xst (export) command to export the user's credentials to a keytab file:
xst -k username.keytab username@YOURDOMAIN.COM
Page 62 of 135
where username is the name of Kerberos principal you want to export, and YOURDOMAIN.COM is
your site's Kerberos realm. This command creates a keytab file named username.keytab in
the current directory.
The following example demonstrates exporting a keytab file for a user named
exampleuser@MYCOMPANY.COM using the kadmin command on a Kerberos client system:
$ sudo kadmin
[sudo] password for dbadmin:
Authenticating as principal root/admin@MYCOMPANY.COM with password.
Password for root/admin@MYCOMPANY.COM:
kadmin: xst -k exampleuser.keytab exampleuser@MYCOMPANY.COM
Entry for principal exampleuser@MYCOMPANY.COM with kvno 2, encryption
type des3-cbc-sha1 added to keytab WRFILE:exampleuser.keytab.
Entry for principal exampleuser@VERTICACORP.COM with kvno 2, encryption
type des-cbc-crc added to keytab WRFILE:exampleuser.keytab.
After exporting the keyfile, you can use the klist command to list the keys stored in the file:
$ sudo klist -k exampleuser.keytab
[sudo] password for dbadmin:
Keytab name: FILE:exampleuser.keytab
KVNO Principal
---- ----------------------------------------------------------------2 exampleuser@MYCOMPANY.COM
2 exampleuser@MYCOMPANY.COM
Page 63 of 135
1. From the Linux command line, start the kadmin utility (kadmin.local if you are logged into the
Kerberos Key Distribution Center). Run the getprinc command for the user:
$ sudo kadmin
[sudo] password for dbadmin:
Authenticating as principal root/admin@MYCOMPANY.COM with password.
Password for root/admin@MYCOMPANY.COM:
kadmin: getprinc exampleuser@MYCOMPANY.COM
Principal: exampleuser@MYCOMPANY.COM
Expiration date: [never]
Last password change: Fri Jul 26 09:40:44 EDT 2013
Password expiration date: [none]
Maximum ticket life: 1 day 00:00:00
Maximum renewable life: 0 days 00:00:00
Last modified: Fri Jul 26 09:40:44 EDT 2013 (root/admin@MYCOMPANY.COM)
Last successful authentication: [never]
Last failed authentication: [never]
Failed password attempts: 0
Number of keys: 2
Key: vno 3, des3-cbc-sha1, no salt
Key: vno 3, des-cbc-crc, no salt
MKey: vno 0
Attributes:
Policy: [none]
In the preceding example, there are two keys stored for the user, both of which are at version
number (vno) 3.
2. To get the version numbers of the keys stored in the keytab file, use the klist command:
$ sudo klist -ek exampleuser.keytab
Keytab name: FILE:exampleuser.keytab
KVNO Principal
---- ---------------------------------------------------------------------2 exampleuser@MYCOMPANY.COM (des3-cbc-sha1)
2 exampleuser@MYCOMPANY.COM (des-cbc-crc)
3 exampleuser@MYCOMPANY.COM (des3-cbc-sha1)
3 exampleuser@MYCOMPANY.COM (des-cbc-crc)
The first column in the output lists the key version number. In the preceding example, the
keytab includes both key versions 2 and 3, so the keytab file can be used to authenticate the
user with Kerberos.
Page 64 of 135
To correct this error, verify that all of the nodes in your HP Vertica cluster have the correct version
of the HP Vertica Connector for HDFS package installed. This error can occur if one or more of the
nodes do not have the supporting libraries installed. These libraries may be missing because one of
the nodes was skipped when initially installing the connector package. Another possibility is that
one or more nodes have been added since the connector was installed.
If you encounter this error, troubleshoot the connection between your HP Vertica and Hadoop
clusters. If there are no problems with the network, determine if either your Hadoop cluster or HP
Vertica cluster is overloaded. If the nodes in either cluster are too busy, they may not be able to
maintain the minimum data transfer rate.
If you cannot resolve the issue causing the slow transfer rate, you can lower the minimum
acceptable speed. To do so, set the low_speed_limit parameter for the Hdfs source. The following
example shows how to set low_speed_limit to 524288 to accept transfer rates as low as 512 KB per
second (half the default lower limit).
=> COPY messages SOURCE Hdfs
(url='http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt',
username='exampleuser', low_speed_limit=524288);
Rows Loaded
-------------
Page 65 of 135
9891287
(1 row)
When you lower the low_speed_limit parameter, the COPYstatement loading data from
HDFSmay take a long time to complete.
You can also increase the low_speed_limit setting if the network between your Hadoop cluster and
HP Vertica cluster is fast. You can choose to increase the lower limit to force COPYstatements to
generate an error, if they are running more slowly than they normally should, given the speed of the
network.
Page 66 of 135
Apache's Hive lets you query data stored in a HadoopDistributed File System (HDFS) the same
way you query data stored in a relational database. Behind the scenes, Hive uses a set of
serializer and deserializer (SerDe) classes to extract data from files stored on the HDFS and
break it into columns and rows. Each SerDe handles data files in a specific format. For example,
one SerDe extracts data from comma-separated data files while another interprets data stored in
JSON format.
Apache HCatalog is a component of the Hadoop ecosystem that makes Hive's metadata
available to other Hadoop components (such as Pig).
WebHCat (formerly known as Templeton) makes HCatalog and Hive data available via a
RESTweb API. Through it, you can make an HTTP request to retrieve data stored in Hive, as
well as information about the Hive schema.
HP Vertica's HCatalog Connector lets you transparently access data that is available through
WebHCat. You use the connector to define a schema in HP Vertica that corresponds to a Hive
database or schema. When you query data within this schema, the HCatalog Connector
transparently extracts and formats the data from Hadoop into tabular data. The data within this
HCatalog schema appears as if it is native to HP Vertica. You can even perform operations such as
joins between HP Vertica-native tables and HCatalog tables. For more details, see How the
HCatalog Connector Works.
Page 67 of 135
The HCatalog Connector always reflects the current state of data stored in Hive.
The HCatalog Connector uses the parallel nature of both HP Vertica and Hadoop to process
Hive data. The result is that querying data through the HCatalog Connector is often faster than
querying the data directly through Hive.
Since HP Vertica performs the extraction and parsing of data, the HCatalog Connector does not
signficantly increase the load on your Hadoop cluster.
The data you query through the HCatalog Connector can be used as if it were native HP Vertica
data. For example, you can execute a query that joins data from a table in an HCatalog schema
with a native table.
Hive's data is stored in flat files in a distributed filesystem, requiring it to be read and
deserialized each time it is queried. This deserialization causes Hive's performance to be much
slower than HP Vertica. The HCatalog Connector has to perform the same process as Hive to
read the data. Therefore, querying data stored in Hive using the HCatalog Connector is much
slower than querying a native HP Vertica table. If you need to perform extensive analysis on
data stored in Hive, you should consider loading it into HP Vertica through the HCatalog
Connector or the WebHDFSconnector. HP Vertica optimization often makes querying data
through the HCatalog Connector faster than directly querying it through Hive.
Hive supports complex data types such as lists, maps, and structs that HP Vertica does not
support. Columns containing these data types are converted to a JSONrepresentation of the
data type and stored as a VARCHAR. See Data Type Conversions from Hive to HP Vertica.
Note: The HCatalog Connector is read only. It cannot insert data into Hive.
Page 68 of 135
Use the latest available Hive version to write ORC files. (You can still read them with earlier
versions.)
Syntax
In the COPY statement, specify a format of ORCas follows:
COPY tableName FROM path ORC;
In the CREATE EXTERNAL TABLE AS COPY statement, specify a format of ORC as follows:
CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM path ORC;
If the file resides on the local file system of the node where you are issuing the command, use a
local file path for path. If the file resides elsewhere in HDFS, use the webhdfs:// prefix and then
specify the host name, port, and file path.Use ONANYNODE for files that are not local to improve
performance.
COPY t FROM 'webhdfs://somehost:port/opt/data/orcfile' ON ANY NODE ORC;
Page 69 of 135
the COPY or CREATE EXTERNAL TABLE AS COPYstatement aborts with an error message.
The ORC reader does not attempt to read only some columns; either the entire file is read or the
operation fails. For a complete list of supported types, see HIVE Data Types.
Kerberos
If the ORC file is located on an HDFS cluster that uses Kerberos authentication, HP Vertica uses
the current user's principal to authenticate. It does not use the database's principal.
Query Performance
When working with external tables inORCformat, HP Vertica tries to improve performance in two
ways: by pushing query execution closer to the data so less has to be read and transmitted, and by
taking advantage of data locality in planning the query.
Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of
data that must be read from disk or across the network. ORC files have three levels of indexing: file
statistics, stripe statistics and row group indexes. Predicates are applied only to the first two levels.
Predicate pushdown works and is automatically applied for ORC files written with HIVE version
0.14 and later. ORCfiles written with earlier versions of HIVEmight not contain the required
statistics. When executing a query against an ORCfile that lacks these statistics, HP Vertica logs
an EXTERNAL_PREDICATE_PUSHDOWN_NOT_SUPPORTED event in the QUERY_
EVENTS system table. If you are seeing performance problems with your queries, check this table
for these events.
In a cluster where HP Vertica nodes are co-located on HDFSnodes, the query can also take
advantage of data locality. If data is on an HDFSnode where a database node is also present, and
if the query is not restricted to specific nodes using ONNODE, then the query planner uses that
database node to read that data. This allows HP Vertica to read data locally instead of making a
network call.
You can see how much ORCdata is being read locally by inspecting the query plan. The label for
LoadStep(s) in the plan contains a statement of the form:"X% of ORCdata matched with colocated Vertica nodes". To increase the volume of local reads, consider adding more database
nodes. HDFS data, by its nature, can't be moved to specific nodes, but if you run more database
nodes you increase the likelihood that a database node is local to one of the copies of the data.
Examples
The following example shows how to read from all ORC files in a directory.It uses all supported
data types.
Page 70 of 135
The following example shows the error that is produced if the file you specify is not recognized as
an ORC file:
CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT)
AS COPY FROM '/data/not_an_orc_file.orc' ORC;
ERROR 0: Failed to read orc source [/data/not_an_orc_file.orc]: Not an ORC file
Page 71 of 135
This approach takes advantage of the parallel nature of both HP Vertica and Hadoop. In addition, by
performing the retrieval and extraction of data directly, the HCatalog Connector reduces the impact
of the query on the Hadoop cluster.
HP Vertica Requirements
All of the nodes in your cluster must have a Java Virtual Machine (JVM) installed. See Installing the
Java Runtime on Your HP Vertica Cluster.
You must also add certain libraries distributed with Hadoop and Hive to your HP Vertica installation
directory. See Configuring HP Vertica for HCatalog.
Hadoop Requirements
Your Hadoop cluster must meet several requirements to operate correctly with the HP Vertica
Connector for HCatalog:
l
It must have Hive and HCatalog installed and running. See Apache's HCatalog page for more
information.
It must have WebHCat (formerly known as Templeton)installed and running. See Apache' s
WebHCat page for details.
The WebHCat server and all of the HDFS nodes that store HCatalog data must be directly
accessible from all of the hosts in your HP Vertica database. Verify that any firewall separating
the Hadoop cluster and the HP Vertica cluster will pass WebHCat, metastore database, and
HDFS traffic.
The data that you want to query must be in an internal or external Hive table.
If a table you want to query uses a non-standard SerDe, you must install the SerDe's classes on
your HP Vertica cluster before you can query the data. See Using Non-Standard SerDes.
Testing Connectivity
To test the connection between your database cluster and WebHcat, log into a node in your HP
Vertica cluster. Then, run the following command to execute an HCatalog query:
Page 72 of 135
$ curl http://webHCatServer:port/templeton/v1/status?user.name=hcatUsername
Where:
l
port is the port number assigned to the WebHCat service (usually 50111)
Usually, you want to append ;echo to the command to add a linefeed after the curl command's
output. Otherwise, the command prompt is automatically appended to the command's output,
making it harder to read.
For example:
$ curl http://hcathost:50111/templeton/v1/status?user.name=hive; echo
If there are no errors, this command returns a status message in JSON format, similar to the
following:
{"status":"ok","version":"v1"}
This result indicates that WebHCat is running and that the HP Vertica host can connect to it and
retrieve a result. If you do not receive this result, troubleshoot your Hadoop installation and the
connectivity between your Hadoop and HP Vertica clusters. For details, see Troubleshooting
HCatalog Connector Problems.
You can also run some queries to verify that WebHCat is correctly configured to work with Hive.
The following example demonstrates listing the databases defined in Hive and the tables defined
within a database:
$ curl http://hcathost:50111/templeton/v1/ddl/database?user.name=hive; echo
{"databases":["default","production"]}
$ curl http://hcathost:50111/templeton/v1/ddl/database/default/table?user.name=hive; echo
{"tables":["messages","weblogs","tweets","transactions"],"database":"default"}
See Apache's WebHCat reference for details about querying Hive using WebHCat.
Page 73 of 135
Page 74 of 135
Note: Any previously installed Java VM on your hosts may interfere with a newly installed
Java runtime. See your Linux distribution's documentation for instructions on configuring which
JVM is the default. Unless absolutely required, you should uninstall any incompatible version
of Java before installing the Java 6 or Java 7 runtime.
If the java command is not in the shell search path, use the path to the Java executable in the
directory where you installed the JRE. Suppose you installed the JRE in /usr/java/default
(which is where the installation package supplied by Oracle installs the Java 1.6 JRE). In this case
the Java executable is /usr/java/default/bin/java.
You set the configuration parameter by executing the following statement as a database
superuser:
=> ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';
Page 75 of 135
Once you have set the configuration parameter, HP Vertica can find the Java executable on each
node in your cluster.
Note: Since the location of the Java executable is set by a single configuration parameter for
the entire cluster, you must ensure that the Java executable is installed in the same path on all
of the hosts in the cluster.
Page 76 of 135
--hadoopHiveConfPath="/hadoop;/hive;/webhcat"
--hcatLibPath=/tmp/hadoop-files
If you are using the HPVertica for SQLon Hadoop product with co-located clusters, you can do this
in one step on a shared node. Your Hadoop, HIVE, and HCatalog lib paths might be different; use
the values from your environment in the following command:
hcatUtil --copyJars --hadoopHiveHome="/hadoop/lib;/hive/lib;/hcatalog/dist/share"
--hadoopHiveConfPath="/hadoop;/hive;/webhcat"
--hcatLibPath=/opt/vertica/packages/hcat/lib
-v, --verifyJars
--hadoopHiveHome=
"value1;value2;..."
Page 77 of 135
-hcatLibPath=
"value1;value2;..."
Now you can create HCatalog schema parameters, which point to your existing
Hadoop/Hive/WebHCat services, as described in Defining a Schema Using the HCatalog
Connector.
Page 78 of 135
To use HANN with HP Vertica, first copy /etc/hadoop/conf from the HDFS cluster to every node in
your HP Vertica cluster. You can put this directory anywhere, but it must be in the same location on
every node. (In the example below it is in /opt/hcat/hadoop_conf.)
Then uninstall the HCat library, configure the UDx to use that configuration directory, and reinstall
the library:
=> \i /opt/vertica/packages/hcat/ddl/uninstall.sql
DROP LIBRARY
=> ALTER DATABASE mydb SET JavaClassPathSuffixForUDx = '/opt/hcat/hadoop_conf';
WARNING 2693: Configuration parameter JavaClassPathSuffixForUDx has been deprecated;
setting it has no effect
=> \i /opt/vertica/packages/hcat/ddl/install.sql
CREATE LIBRARY
CREATE SOURCE FUNCTION
GRANT PRIVILEGE
CREATE PARSER FUNCTION
GRANT PRIVILEGE
the host name or IP address of Hive's metastore database (the database server that contains
metadata about Hive's data, such as the schema and table definitions)
Other parameters are optional. If you do not supply a value, HP Vertica uses default values.
After you define the schema, you can query the data in the Hive data warehouse in the same way
you query a native HP Vertica table. The following example demonstrates creating an HCatalog
schema and then querying several system tables to examine the contents of the new schema. See
Viewing Hive Schema and Table Metadata for more information about these tables.
=> CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' HCATALOG_SCHEMA='default'
Page 79 of 135
-> HCATALOG_USER='hcatuser';
CREATE SCHEMA
=> -- Show list of all HCatalog schemas
=> \x
Expanded display is on.
=> SELECT * FROM v_catalog.hcatalog_schemata;
-[ RECORD 1 ]--------+-----------------------------schema_id
| 45035996273748980
schema_name
| hcat
schema_owner_id
| 45035996273704962
schema_owner
| dbadmin
create_time
| 2013-11-04 15:09:03.504094-05
hostname
| hcathost
port
| 9933
webservice_hostname | hcathost
webservice_port
| 50111
hcatalog_schema_name | default
hcatalog_user_name
| hcatuser
metastore_db_name
| hivemetastoredb
=> -- List the tables in all HCatalog schemas
=> SELECT * FROM v_catalog.hcatalog_table_list;
-[ RECORD 1 ]------+-----------------table_schema_id
| 45035996273748980
table_schema
| hcat
hcatalog_schema
| default
table_name
| messages
hcatalog_user_name | hcatuser
-[ RECORD 2 ]------+-----------------table_schema_id
| 45035996273748980
table_schema
| hcat
hcatalog_schema
| default
table_name
| weblog
hcatalog_user_name | hcatuser
-[ RECORD 3 ]------+-----------------table_schema_id
| 45035996273748980
table_schema
| hcat
hcatalog_schema
| default
table_name
| tweets
hcatalog_user_name | hcatuser
Page 80 of 135
5
6
7
8
9
10
(10 rows)
|
|
|
|
|
|
coROws3OF
oDRP1i
AU7a9Kp
ZJWg185DkZ
E7ipAsYC3
kStCv
|
|
|
|
|
|
2013-10-29
2013-10-29
2013-10-29
2013-10-29
2013-10-29
2013-10-29
00:53:39
01:04:23
01:15:07
01:25:51
01:36:35
01:47:19
|
|
|
|
|
|
Since the tables you access through the HCatalog Connector act like HP Vertica tables, you can
perform operations that use both Hive data and native HP Vertica data, such as a join:
=> SELECT u.FirstName, u.LastName, d.time, d.Message from UserData u
-> JOIN hcat.messages d ON u.UserID = d.UserID LIMIT 10;
FirstName | LastName |
time
|
Message
----------+----------+---------------------+----------------------------------Whitney
| Kerr
| 2013-10-29 00:10:43 | hymenaeos cursus lorem Suspendis
Troy
| Oneal
| 2013-10-29 00:32:11 | porta Vivamus condimentum
Renee
| Coleman | 2013-10-29 00:42:55 | lectus quis imperdiet
Fay
| Moss
| 2013-10-29 00:53:39 | sit eleifend tempus a aliquam mauri
Dominique | Cabrera | 2013-10-29 01:15:07 | turpis vehicula tortor
Mohammad | Eaton
| 2013-10-29 00:21:27 | Fusce ad sem vehicula morbi
Cade
| Barr
| 2013-10-29 01:25:51 | sapien adipiscing eget Aliquam tor
Oprah
| Mcmillan | 2013-10-29 01:36:35 | varius Cum iaculis metus
Astra
| Sherman | 2013-10-29 01:58:03 | dignissim odio Pellentesque primis
Chelsea
| Malone
| 2013-10-29 02:08:47 | pede tempor dignissim Sed luctus
(10 rows)
HCATALOG_SCHEMATAlists all of the schemata (plural of schema) that have been defined
using the HCatalog Connector. See HCATALOG_SCHEMATA in the SQL Reference Manual
for detailed information.
HCATALOG_TABLE_LIST contains an overview of all of the tables available from all schemata
defined using the HCatalog Connector. This table only shows the tables which the user querying
the table can access. The information in this table is retrieved using a single call to WebHCatfor
each schema defined using the HCatalog Connector, which means there is a little overhead
Page 81 of 135
when querying this table. See HCATALOG_TABLE_LIST in the SQL Reference Manual for
detailed information.
l
HCATALOG_COLUMNS lists metadata about all of the columns in all of the tables available
through the HCatalog Connector. Similarly to HCATALOG_TABLES, querying this table results
in one call to WebHCat per table, and therefore can take a while to complete. See HCATALOG_
COLUMNS in the SQL Reference Manual for more information.
The following example demonstrates querying the system tables containing metadata for the tables
available through the HCatalogConnector.
=> CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost'
-> HCATALOG_SCHEMA='default' HCATALOG_DB='default' HCATALOG_USER='hcatuser';
CREATE SCHEMA
=> SELECT * FROM HCATALOG_SCHEMATA;
-[ RECORD 1 ]--------+----------------------------schema_id
| 45035996273864536
schema_name
| hcat
schema_owner_id
| 45035996273704962
schema_owner
| dbadmin
create_time
| 2013-11-05 10:19:54.70965-05
hostname
| hcathost
port
| 9083
webservice_hostname | hcathost
webservice_port
| 50111
hcatalog_schema_name | default
hcatalog_user_name
| hcatuser
metastore_db_name
| hivemetastoredb
=> SELECT * FROM HCATALOG_TABLE_LIST;
-[ RECORD 1 ]------+-----------------table_schema_id
| 45035996273864536
table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
hcatalog_user_name | hcatuser
-[ RECORD 2 ]------+-----------------table_schema_id
| 45035996273864536
table_schema
| hcat
hcatalog_schema
| default
table_name
| tweets
hcatalog_user_name | hcatuser
-[ RECORD 3 ]------+-----------------table_schema_id
| 45035996273864536
Page 82 of 135
table_schema
| hcat
hcatalog_schema
| default
table_name
| messages
hcatalog_user_name | hcatuser
-[ RECORD 4 ]------+-----------------table_schema_id
| 45035996273864536
table_schema
| hcat
hcatalog_schema
| default
table_name
| msgjson
hcatalog_user_name | hcatuser
=> -- Get detailed description of a specific table
=> SELECT * FROM HCATALOG_TABLES WHERE table_name = 'msgjson';
-[ RECORD 1 ]---------+----------------------------------------------------------table_schema_id
| 45035996273864536
table_schema
| hcat
hcatalog_schema
| default
table_name
| msgjson
hcatalog_user_name
| hcatuser
min_file_size_bytes
| 13524
total_number_files
| 10
location
| hdfs://hive.example.com:8020/user/exampleuser/msgjson
last_update_time
| 2013-11-05 14:18:07.625-05
output_format
| org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
last_access_time
| 2013-11-11 13:21:33.741-05
max_file_size_bytes
| 45762
is_partitioned
| f
partition_expression |
table_owner
| hcatuser
input_format
| org.apache.hadoop.mapred.TextInputFormat
total_file_size_bytes | 453534
hcatalog_group
| supergroup
permission
| rwxr-xr-x
=> -- Get list of columns in a specific table
=> SELECT * FROM HCATALOG_COLUMNS WHERE table_name = 'hcatalogtypes'
-> ORDER BY ordinal_position;
-[ RECORD 1 ]------------+----------------table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| intcol
hcatalog_data_type
| int
data_type
| int
data_type_id
| 6
data_type_length
| 8
character_maximum_length |
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 1
-[ RECORD 2 ]------------+----------------table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
Page 83 of 135
column_name
| floatcol
hcatalog_data_type
| float
data_type
| float
data_type_id
| 7
data_type_length
| 8
character_maximum_length |
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 2
-[ RECORD 3 ]------------+----------------table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| doublecol
hcatalog_data_type
| double
data_type
| float
data_type_id
| 7
data_type_length
| 8
character_maximum_length |
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 3
-[ RECORD 4 ]------------+----------------table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| charcol
hcatalog_data_type
| string
data_type
| varchar(65000)
data_type_id
| 9
data_type_length
| 65000
character_maximum_length | 65000
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 4
-[ RECORD 5 ]------------+----------------table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| varcharcol
hcatalog_data_type
| string
data_type
| varchar(65000)
data_type_id
| 9
data_type_length
| 65000
character_maximum_length | 65000
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
Page 84 of 135
ordinal_position
| 5
-[ RECORD 6 ]------------+----------------table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| boolcol
hcatalog_data_type
| boolean
data_type
| boolean
data_type_id
| 5
data_type_length
| 1
character_maximum_length |
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 6
-[ RECORD 7 ]------------+----------------table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| timestampcol
hcatalog_data_type
| string
data_type
| varchar(65000)
data_type_id
| 9
data_type_length
| 65000
character_maximum_length | 65000
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 7
-[ RECORD 8 ]------------+----------------table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| varbincol
hcatalog_data_type
| binary
data_type
| varbinary(65000)
data_type_id
| 17
data_type_length
| 65000
character_maximum_length | 65000
numeric_precision
|
numeric_scale
|
datetime_precision
|
interval_precision
|
ordinal_position
| 8
-[ RECORD 9 ]------------+----------------table_schema
| hcat
hcatalog_schema
| default
table_name
| hcatalogtypes
is_partition_column
| f
column_name
| bincol
hcatalog_data_type
| binary
data_type
| varbinary(65000)
data_type_id
| 17
Page 85 of 135
data_type_length
character_maximum_length
numeric_precision
numeric_scale
datetime_precision
interval_precision
ordinal_position
| 65000
| 65000
|
|
|
|
| 9
Page 86 of 135
Note: You can query tables in the local schema you synched with an HCatalog schema.
Querying tables in a synched schema isn't much faster than directly querying the HCatalog
schema because SYNC_WITH_HCATALOG_SCHEMA only duplicates the HCatalog
schema's metadata. The data in the table is still retrieved using the HCatalog Connector,
VerticaData Type
TINYINT (1-byte)
TINYINT (8-bytes)
Page 87 of 135
VerticaData Type
SMALLINT (2-bytes)
SMALLINT (8-bytes)
INT (4-bytes)
INT (8-bytes)
BIGINT (8-bytes)
BIGINT (8-bytes)
BOOLEAN
BOOLEAN
FLOAT (4-bytes)
FLOAT (8-bytes)
DOUBLE (8-bytes)
STRING (2 GB max)
VARCHAR (65000)
BINARY (2 GB max)
VARBINARY (65000)
LIST/ARRAY
MAP
STRUCT
Page 88 of 135
In the example, ROW FORMAT SERDEindicates that a special SerDe is used to parse the data files.
The next row shows that the class for the SerDe is named
Page 89 of 135
Find the installation files for the SerDe, then copy those over to your HP Vertica cluster. For
example, there are several third-party JSONSerDes available from sites like Google Code and
GitHub. You may find the one that matches the file installed on your Hive cluster. If so, then
download the package and copy it to your HP Vertica cluster.
Directly copy the JARfiles from a Hive server onto your HP Vertica cluster. The location for the
SerDe JAR files depends on your Hive installation. On some systems, they may be located in
/usr/lib/hive/lib.
Wherever you get the files, copy them into the /opt/vertica/packages/hcat/lib directory on
every node in your HP Vertica cluster.
Important: If you add a new host to your HP Vertica cluster, remember to copy every custom
SerDer JARfile to it.
Page 90 of 135
Connection Errors
When you use CREATEHCATALOGSCHEMA to create a new schema, the HCatalog Connector
does not immediately attempt to connect to the WebHCat or metastore servers. Instead, when you
execute a query using the schema or HCatalog-related system tables, the connector attempts to
connect to and retrieve data from your Hadoop cluster.
The types of errors you get depend on which parameters are incorrect. Suppose you have incorrect
parameters for the metastore database, but correct parameters for WebHCat.In this case,
HCatalog-related system table queries succeed, while queries on the HCatalog schema fail. The
following example demonstrates creating an HCatalog schema with the correct default WebHCat
information. However, the port number for the metastore databaseis incorrect.
=> CREATE HCATALOG SCHEMA hcat2 WITH hostname='hcathost'
-> HCATALOG_SCHEMA='default' HCATALOG_USER='hive' PORT=1234;
CREATE SCHEMA
=> SELECT * FROM HCATALOG_TABLE_LIST;
-[ RECORD 1 ]------+--------------------table_schema_id
| 45035996273864536
table_schema
| hcat2
hcatalog_schema
| default
table_name
| test
hcatalog_user_name | hive
=> SELECT * FROM hcat2.test;
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined
Object [VHCatSource], error code: 0
com.vertica.sdk.UdfException: Error message is [
org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat not
initialized, setOutput has to be called. Cause : java.io.IOException:
MetaException(message:Could not connect to meta store using any of the URIs
provided. Most recent failure: org.apache.thrift.transport.TTransportException:
java.net.ConnectException:
Connection refused
at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(
HiveMetaStoreClient.java:277)
. . .
To resolve these issues, you must drop the schema and recreate it with the correct parameters. If
you still have issues, determine whether there are connectivity issues between your HP Vertica
cluster and your Hadoop cluster. Such issues can include a firewall that prevents one or more HP
Vertica hosts from contacting the WebHCat, metastore, or HDFS hosts.
You may also see this error if you are using HA NameNode, particularly with larger tables that
HDFS splits into multiple blocks. See Using the HCatalog Connector with HA NameNode for more
information about correcting this problem.
Page 91 of 135
You are not using the same version of Java on your Hadoop and HP Vertica nodes. In this case
you need to change one of them to match the other.
You have not used hcatUtil to copy the Hadoop and Hive libraries to HP Vertica.
You copied the libraries but they no longer match the versions of Hive and Hadoop that you are
using.
The version of Hadoop you are using relies on a third-party library that you must copy manually.
If you did not copy the libraries, follow the instructions in Configuring HP Vertica for HCatalog.
If the Hive jars that you copied from Hadoop are out of date, you might see the an error message
like the following:
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object
[VHCatSource],
error code: 0 Error message is [ Found interface org.apache.hadoop.mapreduce.JobContext,
but
class was expected ]
HINT hive metastore service is thrift://localhost:13433 (check
UDxLogs/UDxFencedProcessesJava.log
in the catalog directory for more information)
This usually signals a problem with hive-hcatalog-core jar. Make sure you have an up-to-date copy
of this. Remember that if you rerun hcatUtil you also need to re-create the HCatalog schema.
You might also see a different form of this error:
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object
[VHCatSource],
error code: 0 Error message is [ javax/servlet/Filter ]
This error can be reported even if hcatUtil reports that your libraries are up to date. The
javax.servlet.Filter class is in a library that some versions of Hadoop use but that is not usually part
of the Hadoop installation directly. If you see an error mentioning this class, locate servlet-api-*.jar
on a Hadoop node and copy it to the hcat/lib directory on all database nodes. If you cannot locate it
on a Hadoop node, locate and download it from the Internet. (This case is rare.) The library version
must be 2.3 or higher.
Page 92 of 135
Once you have copied the jar to the hcat/lib directory, reinstall the HCatalog connector as explained
in Configuring HP Vertica for HCatalog.
SerDe Errors
Errors can occur if you attempt to query a Hive table that uses a non-standard SerDe. If you have
not installed the SerDe JARfiles on your HP Vertica cluster, you receive an error similar to the one
in the following example:
=> SELECT * FROM hcat.jsontable;
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined
Object [VHCatSource], error code: 0
com.vertica.sdk.UdfException: Error message is [
org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat not
initialized, setOutput has to be called. Cause : java.io.IOException:
java.lang.RuntimeException:
MetaException(message:org.apache.hadoop.hive.serde2.SerDeException
SerDe com.cloudera.hive.serde.JSONSerDe does not exist) ] HINT If error
message is not descriptive or local, may be we cannot read metadata from hive
metastore service thrift://hcathost:9083 or HDFS namenode (check
UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information)
at com.vertica.hcatalogudl.HCatalogSplitsNoOpSourceFactory
.plan(HCatalogSplitsNoOpSourceFactory.java:98)
at com.vertica.udxfence.UDxExecContext.planUDSource(UDxExecContext.java:898)
. . .
In the error message, you can see that the root cause is a missing SerDe class (shown in bold). To
resolve this issue, install the SerDe class on your HP Vertica cluster. See Using Non-Standard
SerDes for more information.
This error may occur intermittently if just one or a few hosts in your cluster do not have the SerDe
class.
Page 93 of 135
parameters. You can also set them for specific HCatalog schemas in the
CREATEHCATALOGSCHEMA statement. These specific settings override the settings in the
configuration parameters.
The HCatConnectionTimeout configuration parameter and the CREATEHCATALOGSCHEMA
statement's HCATALOG_CONNECTION_TIMEOUT parameter control how many seconds the
HCatalog Connector waits for a connection to the WebHCat server. Avalue of 0 (the default setting
for the configuration parameter)means to wait indefinitely. If the WebHCat server does not respond
by the time this timeout elapses, the HCatalog Connector breaks the connection and cancels the
query. If you find that some queries on an HCatalog schema pause excessively, try setting this
parameter to a timeout value, so the query does not hang indefinitely.
The HCatSlowTransferTime configuration parameter and the
CREATEHCATALOGSCHEMAstatement's HCATALOG_SLOW_TRANSFER_TIME parameter
specify how long the HCatlog Connector waits for data after making a successful connection to the
WebHCat server. After the specified time has elapsed, the HCatalog Connector determines
whether the data transfer rate from the WebHCat server is at least the value set in the
HCatSlowTransferLimit configuration parameter (or by the CREATE HCATALOG SCHEMA
statement's HCATALOG_SLOW_TRANSFER_LIMIT parameter). If it is not, then the HCatalog
Connector terminates the connection and cancels the query.
You can set these parameters to cancel queries that run very slowly but do eventually complete.
However, query delays are usually caused by a slow connection rather than a problem establishing
the connection. Therefore, try adjusting the slow transfer rate settings first. If you find the cause of
the issue is connections that never complete, you can alternately adjust the Linux TCP socket
timeouts to a suitable value instead of relying solely on the HCatConnectionTimeout parameter.
Page 94 of 135
All of the nodes in your HP Vertica cluster can connect to all of the nodes in your Hadoop
cluster. Any firewall between the two clusters must allow connections on the ports used by
HDFS. See Testing Your Hadoop WebHDFS Configuration for a procedure to test the
connectivity between your HP Vertica and Hadoop clusters.
You have a Hadoop user whose username matches the name of the HP Vertica database
administrator (usually named dbadmin). This Hadoop user must have read and write access to
the HDFSdirectory where you want HP Vertica to store its data.
Your HDFShas enough storage available for HP Vertica data. See HDFSSpace Requirements
below for details.
The data you store in an HDFS-backed storage location does not expand your database's size
beyond any data allowance in your HP Vertica license. HP Vertica counts data stored in an
HDFS-backed storage location as part of any data allowance set by your license. See Managing
Licenses in the Administrator's Guide for more information.
Page 95 of 135
Have HDFS2.0 or later installed. The vbr.py backup utility uses the snapshot feature introduced
in HDFS2.0.
Have snapshotting enabled for the directories to be used for backups. The easiest way to do this
is to give the database administrator's account superuser privileges in Hadoop, so that
snapshotting can be set automatically. Alternatively, use Hadoop to enable snapshotting for
each directory before using it for backups.
Have enough Hadoop components and libraries installed in order to run the Hadoop distcp
command as the HP Vertica database-administrator user (usually dbadmin).
Caution: After you have created an HDFSstorage location, full database backups will fail with
the error message:
ERROR 5127: Unable to create snapshot No such file /usr/bin/hadoop:
check the HadoopHome configuration parameter
Page 96 of 135
This error is caused by the backup script not being able to back up the HDFS storage
locations. You must configure HP Vertica and Hadoop to enable the backup script to back
these locations. After you configure HP Vertica and Hadoop, you can once again perform full
database backups.
See Backing Up HDFS Storage Locationsfor details on configuring your HP Vertica and Hadoop
clusters to enable HDFSstorage location backup.
Page 97 of 135
Never allow another program that has access to HDFSto write to the ROS files. Any outside
modification of these files can lead to data corruption and loss.
Use the HP Vertica Connector for Hadoop MapReduce if you need your MapReduce job to access
HP Vertica data. Other applications must use the HP Vertica client libraries to access HP Vertica
data.
The storage location stores and reads only ROScontainers. It cannot read data stored in native
formats in HDFS. If you want HP Vertica to read data from HDFS, use the HP Vertica Connector
for HDFS. If the data you want to access is available in a Hive database, you can use the HP
Vertica Connector for HCatalog.
If your HDFS cluster is unsecured, create a Hadoop user whose username matches the user
name of the HP Vertica database administrator account. For example, suppose your database
administrator account has the default username dbadmin. You must create a Hadoop user
account named dbadmin and give it full read and write access to the directory on HDFS to store
files.
If your HDFS cluster uses Kerberos authentication, create a Kerberos principal for HP Vertica
and give it read and write access to the HDFS directory that will be used for the storage location.
See Configuring Kerberos.
Consult the documentation for your Hadoop distribution to learn how to create a user and grant the
user read and write permissions for a directory in HDFS.
Use the CREATE LOCATIONstatement to create an HDFS storage location. To do so, you must:
l
Supply the WebHDFSURI for HDFS directory where you want HP Vertica to store the
location's data as the path argument,. This URI is the same as a standard HDFSURL, except it
uses the webhdfs:// protocol and its path does not start with /webhdfs/v1/.
Page 98 of 135
Is located on the Hadoop cluster whose name node's host name is hadoop.
Is labeled coldstorage.
The example also demonstrates querying the STORAGE_LOCATIONS system table to verify that
the storage location was created.
=> CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' ALL NODES SHARED
USAGE 'data' LABEL 'coldstorage';
CREATE LOCATION
=> SELECT node_name,location_path,location_label FROM STORAGE_LOCATIONS;
node_name
|
location_path
| location_label
------------------+------------------------------------------------------+--------------v_vmart_node0001 | /home/dbadmin/VMart/v_vmart_node0001_data
|
v_vmart_node0001 | webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 | coldstorage
v_vmart_node0002 | /home/dbadmin/VMart/v_vmart_node0002_data
|
v_vmart_node0002 | webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 | coldstorage
v_vmart_node0003 | /home/dbadmin/VMart/v_vmart_node0003_data
|
v_vmart_node0003 | webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 | coldstorage
(6 rows)
Each node in the cluster has created its own directory under the dbadmin directory in HDFS. These
individual directories prevent the nodes from interfering with each other's files in the shared
location.
Next, set the storage policy for your database to use this location:
=> SELECT set_object_storage_policy('DBNAME','HDFS);
Page 99 of 135
This causes all data to be written to the HDFSstorage location instead of the local disk.
Any active standby nodes in your cluster when you create an HDFS-based storage location
automatically create their own instances of the location. When the standby node takes over for a
down node, it uses its own instance of the location to store data for objects using the HDFS-based
storage policy. Treat standby nodes added after you create the storage location as any other new
node. You must manually define the HDFS storage location.
This table's data is moved to the HDFS storage location with the next merge-out. Alternatively, you
can have HP Vertica move the data immediately by using the enforce_storage_move parameter.
You can query the STORAGE_CONTAINERS system table and examine the location_label
column to verify that HP Vertica has moved the data:
=> SELECT node_name, projection_name, location_label, total_row_count FROM V_
MONITOR.STORAGE_CONTAINERS
WHERE projection_name ILIKE 'messages%';
node_name
| projection_name | location_label | total_row_count
------------------+-----------------+----------------+----------------v_vmart_node0001 | messages_b0
| coldstorage
|
366057
v_vmart_node0001 | messages_b1
| coldstorage
|
366511
v_vmart_node0002 | messages_b0
| coldstorage
|
367432
v_vmart_node0002 | messages_b1
| coldstorage
|
366057
v_vmart_node0003 | messages_b0
| coldstorage
|
366511
v_vmart_node0003 | messages_b1
| coldstorage
|
367432
(6 rows)
See Creating Storage Policies in the Administrator's Guide for more information about assigning
storage policies to objects.
ORDER BY partition_key;
partition_key | projection_name |
node_name
| location_label
--------------+-----------------+------------------+---------------201309
| messages_b1
| v_vmart_node0001 |
201309
| messages_b0
| v_vmart_node0003 |
201309
| messages_b1
| v_vmart_node0002 |
201309
| messages_b1
| v_vmart_node0003 |
201309
| messages_b0
| v_vmart_node0001 |
201309
| messages_b0
| v_vmart_node0002 |
201310
| messages_b0
| v_vmart_node0002 |
201310
| messages_b1
| v_vmart_node0003 |
201310
| messages_b0
| v_vmart_node0001 |
. . .
201405
| messages_b0
| v_vmart_node0002 |
201405
| messages_b1
| v_vmart_node0003 |
201405
| messages_b1
| v_vmart_node0001 |
201405
| messages_b0
| v_vmart_node0001 |
(54 rows)
Next, suppose you find that most queries on this table access only the latest month or two of data.
You may decide to move the older data to cold storage in an HDFS-based storage location. After
you move the data, it is still available for queries, but with lower query performance.
To move partitions to the HDFS storage location, supply the lowest and highest partition key values
to be moved in the SET_OBJECT_STORAGE_POLICY function call. The following example
shows how to move data between two dates to an HDFS-basedstorage location. In this example:
l
After the statement finishes, the range of partitions now appear in the HDFS storage location
labeled coldstorage. This location name now displays in the PARTITIONS system table's location_
label column.
=> SELECT partition_key, projection_name, node_name, location_label
FROM partitions ORDER BY partition_key;
partition_key | projection_name |
node_name
| location_label
--------------+-----------------+------------------+---------------201309
| messages_b0
| v_vmart_node0003 | coldstorage
201309
| messages_b1
| v_vmart_node0001 | coldstorage
201309
| messages_b1
| v_vmart_node0002 | coldstorage
201309
| messages_b0
| v_vmart_node0001 | coldstorage
. . .
201403
201404
201404
201404
201404
201404
201404
201405
201405
201405
201405
201405
201405
(54 rows)
|
|
|
|
|
|
|
|
|
|
|
|
|
messages_b0
messages_b0
messages_b0
messages_b1
messages_b1
messages_b0
messages_b1
messages_b0
messages_b1
messages_b0
messages_b0
messages_b1
messages_b1
|
|
|
|
|
|
|
|
|
|
|
|
|
v_vmart_node0002
v_vmart_node0001
v_vmart_node0002
v_vmart_node0001
v_vmart_node0002
v_vmart_node0003
v_vmart_node0003
v_vmart_node0001
v_vmart_node0002
v_vmart_node0002
v_vmart_node0003
v_vmart_node0001
v_vmart_node0003
| coldstorage
|
|
|
|
|
|
|
|
|
|
|
|
After your initial data move, you can move additional data to the HDFSstorage location
periodically. You move individual partitions or a range of partitions from the "hot" storage to the
"cold" storage location using the same method:
=> SELECT SET_OBJECT_STORAGE_POLICY('messages', 'coldstorage', '201404', '201404'
USING PARAMETERS ENFORCE_STORAGE_MOVE = 'true');
SET_OBJECT_STORAGE_POLICY
---------------------------Object storage policy set.
(1 row)
=> SELECT projection_name, node_name, location_label
FROM PARTITIONS WHERE PARTITION_KEY = '201404';
projection_name |
node_name
| location_label
-----------------+------------------+---------------messages_b0
| v_vmart_node0002 | coldstorage
messages_b0
| v_vmart_node0003 | coldstorage
messages_b1
| v_vmart_node0003 | coldstorage
messages_b0
| v_vmart_node0001 | coldstorage
messages_b1
| v_vmart_node0002 | coldstorage
messages_b1
| v_vmart_node0001 | coldstorage
(6 rows)
1. Create a new table whose schema matches that of the existing partitioned table.
2. Set the storage policy of the new table to use the HDFS-based storage location.
3. Use the MOVE_PARTITIONS_TO_TABLEfunction to move a range of partitions from the hot
table to the cold table.
The following example demonstrates these steps. You first create a table named cold_messages.
You then assign it the HDFS-based storage location named coldstorage, and, finally, move a range
of partitions.
=> CREATE TABLE cold_messages LIKE messages INCLUDING PROJECTIONS;
=> SELECT SET_OBJECT_STORAGE_POLICY('cold_messages', 'coldstorage');
=> SELECT MOVE_PARTITIONS_TO_TABLE('messages','201309','201403','cold_messages');
Note: The partitions moved using this method do not immediately migrate to the storage
location on HDFS. Instead, the Tuple Mover eventually moves them to the storage location.
This error is caused by the backup script not being able to back up the HDFS storage
locations. You must configure HP Vertica and Hadoop to enable the backup script to back
these locations. After you configure HP Vertica and Hadoop, you can once again perform full
database backups.
There are several considerations for backing up HDFS storage locations in your database:
The HDFS storage location backup feature relies on the snapshotting feature introduced in
HDFS2.0. You cannot back up an HDFS storage location stored on an earlier version of HDFS.
HDFSstorage locations do not support object-level backups. You must perform a full database
backup in order to back up the data in your HDFS storage locations.
Data in an HDFS storage location is backed up to HDFS. This backup guards against accidental
deletion or corruption of data. It does not prevent data loss in the case of a catastrophic failure of
the entire Hadoop cluster. To prevent data loss, you must have a backup and disaster recovery
plan for your Hadoop cluster.
Data stored on the Linux native filesystem is still backed up to the location you specify in the
backup configuration file. It and the data in HDFS storage locations are handled separately by
the vbr.py backup script.
You must configure your HP Vertica cluster in order to restore database backups containing an
HDFSstorage location. See Configuring HPVertica to Back Up HDFSStorage Locations for
the configuration steps you must take.
The HDFSdirectory for the storage location must have snapshotting enabled.You can either
directly configure this yourself or enable the database administrators Hadoop account to do it for
you automatically. See Configuring Hadoop to Enable Backup of HDFS Storage for more
information.
The topics in this section explain the configuration steps you must take to enable the backup of
HDFS storage locations.
The distribution and version of Hadoop running on the Hadoop cluster containing your HDFS
storage location.
Note: Installing the Hadoop packages necessary to run distcp does not turn your HP Vertica
database into a Hadoop cluster. This process installs just enough of the Hadoop support files
on your cluster to run the distcp command. There is no additional overhead placed on the HP
Vertica cluster, aside from a small amount of additional disk space consumed by the Hadoop
support files.
Configuration Overview
The steps for configuring your HP Vertica cluster to restore backups for HDFS storage location are:
1. If necessary, install and configure a Java runtime on the hosts in the HP Vertica cluster.
2. Find the location of your Hadoop distribution's package repository.
3. Add the Hadoop distribution's package repository to the Linux package manager on all hosts in
your cluster.
4. Install the necessary Hadoop packages on your HP Vertica hosts.
5. Set two configuration parameters in your HP Vertica database related to Java and Hadoop.
6. If your HDFSstorage location uses Kerberos, set additional configuration parameters to allow
HP Vertica user credentials to be proxied.
7. Confirm that the Hadoop distcp command runs on your HP Vertica hosts.
The following sections describe these steps in greater detail.
Execute User-Defined Extensions developed in Java. See Developing User Defined Functions
in Java for more information.
Access Hadoop data using the HCatalog Connector. See Using the HCatalog Connector for
more information.
If your HP Vertica database does have a JVMinstalled, you must verify thatyour Hadoop
distribution supports it. See your Hadoop distribution's documentation to determine which JVMs it
supports.
If the JVMinstalled on your HP Vertica cluster is not supported by your Hadoop distribution you
must uninstall it. Then you must install a JVMthat is supported by both HP Vertica and your
Hadoop distribution. See HP Vertica SDKs in Supported Platforms for a list of the JVMs compatible
with HP Vertica.
If your HP Vertica cluster does not have a JVM(or its existing JVMis incompatible with your
Hadoop distribution),follow the instruction in Installing the Java Runtime on Your HP Vertica
Cluster.
The "Steps to Install CDH 5 Manually" section of the Cloudera Version 5.1.0 topic Installing
CDH5.
Each Hadoop distribution maintains separate repositories for each of the major Linux package
management systems. Find the specific repository for the Linux distribution running on your HP
Vertica cluster. Be sure that the package repository that you select matches version of Hadoop
distribution installed on your Hadoop cluster.
Adding the configuration file to the package management system's configuration directory.
For Debian-based Linux distributions, adding the Hadoop repository encryption key to the root
account keyring.
Updating the package management system's index to have it discover new packages.
The following example demonstrates adding the Hortonworks 2.1 package repository to an Ubuntu
12.04 host. These steps in this example are explained in the Hortonworks documentation.
$ wget http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/hdp.list \
-O /etc/apt/sources.list.d/hdp.list
--2014-08-20 11:06:00-- http://public-repo1.hortonworks.com/HDP/ubuntu12/2.1.3.0/hdp.list
Connecting to 16.113.84.10:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 161 [binary/octet-stream]
Saving to: `/etc/apt/sources.list.d/hdp.list'
100%[======================================>] 161
--.-K/s
in 0s
. . .
Reading package lists... Done
You must add the Hadoop repository to all hosts in your HP Vertica cluster.
hadoop
hadoop-hdfs
hadoop-client
The names of the packages are usually the same across all Hadoop and Linux distributions.These
packages often have additional dependencies. Always accept any additional packages that the
Linux package manager asks to install.
To install these packages, use the package manager command for your Linux distribution. The
package manager command you need to use depends on your Linux distribution:
l
JavaBinaryForUDx is the path to the Java executable. You may have already set this value to
use Java UDxs or the HCatalog Connector. You can find the path for the default Java
executable from the Bash command shell using the command:
which java
l
HadoopHome is the path where Hadoop is installed on the HP Vertica hosts. This is the
directory that contains bin/hadoop (the bin directory containing the Hadoop executable file).
The default value for this parameter is /usr. The default value is correct if your Hadoop
executable is located at /usr/bin/hadoop.
The following example demonstrates setting and then reviewing the values of these parameters.
=> ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';
=> SELECT get_config_parameter('JavaBinaryForUDx');
get_config_parameter
---------------------/usr/bin/java
(1 row)
HadoopFSReplication is the number of replicas HDFS makes. By default the Hadoop client
chooses this; HP Vertica uses the same value for all nodes. We recommend against changing
this unless directed to.
HadoopFSBlockSizeBytes is the block size to write to HDFS; larger files are divided into blocks
of this size. The default is 64MB.
Parameter
Value
yarn.resourcemanager.proxy-user-privileges.enabled
true
yarn.resourcemanager.proxyusers.*.groups
yarn.resourcemanager.proxyusers.*.hosts
yarn.resourcemanager.proxyusers.*.users
yarn.timeline-service.http-authentication.proxyusers.*.groups
yarn.timeline-service.http-authentication.proxyusers.*.hosts
yarn.timeline-service.http-authentication.proxyusers.*.users
No changes are needed on HDFS nodes that are not also HP Vertica nodes.
-p <arg>
-sizelimit <arg>
-skipcrccheck
-strategy <arg>
-tmp <arg>
-update
4. Repeat these steps on the other hosts in your database to ensure all of the hosts can run
distcp.
Troubleshooting
If you cannot run the distcp command, try the following steps:
l
If Bash cannot find the hadoop command, you may need to manually add Hadoop's bin
directory to the system search path. An alternative is to create a symbolic link in an existing
directory in the search path (such as /usr/bin) to the hadoop binary.
Ensure the version of Java installed on your HP Vertica cluster is compatible with your Hadoop
distribution.
Review the Linux package installation tool's logs for errors. In some cases, packages may not
be fully installed, or may not have been downloaded due to network issues.
Ensure that the database administrator account has permission to execute the hadoop
command. You may need to add the account to a specific group in order to allow it to run the
necessary commands.
stored in the HP Vertica keytab file, usually vertica. The instructions below use the term "database
account" to refer to this user.
We recommend that you make the database administrator or principal a Hadoop superuser. If you
are not able to do so, you must enable snapshotting on the directory before configuring it for use by
HP Vertica.
The steps you need to take to make the HP Vertica database administrator account a superuser
depend on the distribution of Hadoop you are using. Consult your Hadoop distribution's
documentation for details. Instructions for two distributions are provided here.
5. Verify that the database account is now a member of supergroup using the groups command.
6. Repeat steps 1 through 5 for any other NameNodes in your Hadoop cluster.
The following example demonstrates following these steps to grant the database administrator
superuser status.
# adduser dbadmin
# groupadd supergroup
# usermod -a -G supergroup dbadmin
# groups dbadmin
dbadmin : dbadmin supergroup
Consult the Linux distribution installed on your Hadoop cluster for more information on managing
users and groups.
Issue this command for each directory on each node. Remember to do this each time you add a
new node to your HDFS cluster.
Nested snapshottable directories are not allowed, so you cannot enable snapshotting for a parent
directory to automatically enable it for child directories. You must enable it for each individual
directory.
Drop all of the objects (tables or schemata)that store data in the location. This is the simplest
option. However, you can only use this method if you no longer need the data stored in the
HDFSstorage location.
Change the storage policies of objects stored on HDFSto another storage location. When you
alter the storage policy, you force all of the data in HDFSlocation to move to the new location.
This option requires that you have an alternate storage location available.
Clear the storage policies of all objects that store data on the storage location. You then move
the location's data through a process of retiring it.
The following sections explain the last two options in greater detail.
v_vmart_node0003 | test_b1
(12 rows)
| ssd
334136
Once you have moved all of the data in the storage location, you are ready to proceed to the next
step of removing the storage location.
webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 restored.
(1 row)
=> SELECT restore_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003',
'v_vmart_node0003');
restore_location
---------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 restored.
(1 row)
The HP Vertica backup script creates snapshots of HDFS storage locations as part of the backup
process. See Backing Up HDFS Storage Locations for more information. If you made backups of
your HDFS storage location, you must delete the snapshots before removing the directories.
HDFS stores snapshots in a subdirectory named .snapshot. You list the snapshots in the directory
using the standard HDFSls command. The following example demonstrates listing the snapshots
defined for node0001.
$ hdfs dfs -ls /user/dbadmin/v_vmart_node0001/.snapshot
Found 1 items
drwxrwx--- dbadmin supergroup
0 2014-09-02 10:13 /user/dbadmin/v_vmart_
node0001/.snapshot/s20140902-101358.629
The following example demonstrates the command to delete the snapshot shown in the previous
example:
$ hdfs dfs -deleteSnapshot /user/dbadmin/v_vmart_node0001 s20140902-101358.629
You must delete each snapshot from the directory for each host in the cluster. Once you have
deleted the snapshots, you can delete the directories in the storage location.
Important: Each snapshot's name is based on a timestamp down to the millisecond. Nodes
independently create their own snapshot. They do not synchronize snapshot creation, so their
snapshot names differ. You must list each node's snapshot directory to learn the names of the
snapshots it contains.
See Apache's HDFSSnapshot documentation for more information about managing and removing
snapshots.
Use an HDFSfile manager to delete directories. See your Hadoop distribution's documentation
to determine if it provides a file manager.
Log into the Hadoop NameNode using the database administrators account and use HDFS's
rmr command to delete the directories. See Apache's File System Shell Guide for more
information.
The following example uses the HDFS rmr command from the Linux command line to delete the
directories left behind in the HDFS storage location directory /user/dbamin. It uses the skipTrash flag to force the immediate deletion of the files.
$ hdfsp dfs -ls /user/dbadmin
Found 3 items
drwxrwx--- dbadmin supergroup
node0001
drwxrwx--- dbadmin supergroup
node0002
drwxrwx--- dbadmin supergroup
node0003
This command prints the usage for the entire HDFSstorage, followed by details for each node in
the Hadoop cluster. The following example shows the beginning of the output from this command,
with the total disk space highlighted:
After loading a simple million-row table into a table stored in an HDFSstorage location, the report
shows greater disk usage:
Configured Capacity: 51495516981 (47.96 GB)
Present Capacity: 32085299338 (29.88 GB)
DFS Remaining: 31373565952 (29.22 GB)
DFS Used: 711733386 (678.76 MB)
DFS Used%: 2.22%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
. . .
Running the HDFSreport on Hadoop now shows less disk space use:
Caution: Reducing the number of copies of data stored by HDFS increases the risk of data
loss. It can also negatively impact the performance of HDFS by reducing the number of nodes
that can provide access to a file. This slower performance can impact the performance of HP
Vertica queries that involve data stored in an HDFS storage location.
When restoring an HDFS storage location that uses Kerberos, you might see an error such as:
Either of these failures means that HP Vertica could not find the required configuration files in the
HadoopConfDir directory. Usually this is because you have set the parameter but not copied the
files from an HDFSnode to your HP Vertica node. See "Additional Requirements for Kerberos" in
Configuring Hadoop and HP Vertica to Enable Backup of HDFS Storage .
Configuring Kerberos
User AuthenticationOn behalf of the user, by passing along the user's existing Kerberos
credentials, as occurs with the HDFSConnector and the HCatalog Connector.
User Authentication
To use HP Vertica with Kerberos and Hadoop, the client user first authenticates with the Kerberos
server (Key Distribution Center, or KDC) being used by the Hadoop cluster. A user might run kinit or
sign in to Active Directory, for example. A user who authenticates to aKerberos server receives a
Kerberos ticket. At the beginning of a client session, HP Vertica automatically retrieves this
ticket.The database then uses this ticket to get a Hadoop token, which Hadoop uses to grant
access. HP Vertica uses this token to access HDFS, such as when executing a query on behalf of
the user. When the token expires, the database automatically renews it, also renewing the
Kerberos ticket if necessary.
The following figure shows how the user, HP Vertica, Hadoop, and Kerberos interact in user
authentication:
When using the HDFSConnector or the HCatalog Connector, or when reading an ORCfile stored
in HDFS, HP Vertica uses the client identity as the preceding figure shows.
HP Vertica Authentication
Automatic processes, such as the Tuple Mover, do not log in the way users do. Instead, HP Vertica
uses a special identity (principal) stored in a keytab file on every database node. (This approach is
also used for HP Vertica clusters that use Kerberos but do not use Hadoop.) After you configure the
keytab file, Vertica uses the principal residing there to automatically obtain and maintain a Kerberos
ticket, much as in the client scenario. In this case, the client does not interact with Kerberos.
The following figure shows the interactions required for HP Vertica authentication:
Each HP Vertica node uses its own principal; it is common to incorporate the name of the node into
the principal name. You can either create one keytab per node, containing only that node's principal,
or you can create a single keytab containing all the principals and distribute the file to all nodes.
Either way, the node uses its principal to get a Kerberos ticket and then uses that ticket to get a
Hadoop token. For simplicity, the preceding figure shows the full set of interactions for only one
database node.
When creating HDFS storage locations HP Vertica uses the principal in the keytab file, not the
principal of the user issuing the CREATELOCATION statement.
See Also
For specific configuration instructions, see Configuring Kerberos.
Configuring Kerberos
HP Vertica can connect with Hadoop in several ways, and how you manage Kerberos
authentication varies by connection type. This documentation assumes that you are using Kerberos
for both your HDFS and HP Vertica clusters.
Place the keytab file(s) in the same location on each database node and set its location in
KerberosKeytabFile (see Specify the Location of the Keytab File).
Set KerberosServiceName to the name of the principal (see Inform HPVertica About the
Kerberos Principal).
HCatalog Connector
You use the HCatalog Connector to query data in Hive. Queries are executed on behalf of HP
Vertica users. If the current user has a Kerberos key, then HP Vertica passes it to the HCatalog
connector automatically. Verify that all users who need access to Hive have been granted access
to HDFS.
In addition, in your Hadoop configuration files (core-site.xml in most distributions), make sure that
you enable all Hadoop components to impersonate the HP Vertica user. The easiest way to do this
is to set the proxyuser property using wildcards for all users on all hostsand in all groups. Consult
your Hadoop documentation for instructions. Make sure you do this before running hcatUtil (see
Configuring HP Vertica for HCatalog).
HDFS Connector
The HDFSConnector loads data from HDFS into HP Vertica on behalf of the user, using a User
Defined Source. If the user performing the data load has a Kerberos key, then the UDS uses it to
access HDFS. Verify that all users who use this connector have been granted access to HDFS.
Configure user impersonation to be able to restore from backups following the instructions in
"Setting Kerberos Parameters" in Configuring HPVertica to Restore HDFSStorage Locations.
Because the keytab file supplies the principal used to create the location, you must have it in place
before creating the storage location. After you deploy keytab files to all database nodes, use the
CREATE LOCATION statement to create the storage location as usual.
Token Expiration
HP Vertica attempts to automatically refresh Hadoop tokens before they expire, but you can also
set a minimum refresh frequency if you prefer. The HadoopFSTokenRefreshFrequency
configuration parameter specifies the frequency in seconds:
=> ALTER DATABASE exampledb SET HadoopFSTokenRefreshFrequency = '86400';
If the current age of the token is greater than the value specified in this parameter, HP Vertica
refreshes the token before accessing data stored in HDFS.
See Also
l