Vous êtes sur la page 1sur 17

What happens if we modify/add more data while a Hive script is already

running? Will there be any data consistency issues?


Answer) Technically there wont be any issue in your result. New data will be
available only after the MR job for that insert is done. Until it is complete, you will
not be able to query it.

How can I automate the process of verifying that my query executed successfully while
running hive -e "query"?
This is assuming you are using the terminal (shell or shellscript for the matter.)
In bash,
var=`hive -e "query" `
echo $?
$? gives you the return status of the last executed (our hive query) command .It is
non zero in case of syntactical errors etc. and zero on success. $? can be used in
your conditional checks to decide further course.

How can I load data on HDFS into Hive in a way that copies the files into the Hive
warehouse directory instead of moving?

From the question, it seems the data is already present in the hdfs, so instead of
loading the data you can create the "EXTERNAL" table specifying the
location.
create external table table_name (
id int,
name string
)
location '/my/location/hdfs';

The answer can be found in the Hive getting started guide


https://cwiki.apache.org/conflue...
Start by creating a table without the External keyword:
hive> CREATE TABLE pokes (foo INT, bar STRING);
Now load a local (non HDFS) file into the table.

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE


INTO TABLE pokes;
Because pokes is not declared to be an external table (external to the
warehouse directory, but still on HDFS), the table will be created in the
warehouse directory. The second statement loads the file into the table. If you .

What are the problems that engineers have run into using Hive
There are no row-level update statements in Hive QL. You basically have to
overwrite all the files associated with a partition/table every time you want to
do an update. While this doesn't seem so bad at first, imagine if you had to
perform this "update" overwrite on terabytes of data frequently.
There is hope though! An integration of Hive and HBase is in the works.
Although user variable substitution is supported, the query cannot overwrite user
variables. This means you cannot do running sums or top N per group. You
can only do an equality join. This means you cannot easily do ranking.

Written 21 Sep, 2011.


.
The query planner is not very good - it won't reorder your joins for efficiency (small left
outer join big is usually faster than the reverse), and its estimation of the number of
mappers/reducers to allocate is sometimes off by quite a bit. It also seems to be bad at
applying "where" filters as early as possible, applying them at the end instead.

I am having a table with 10 fields, including time and language. Now I


want to display the total count of the language which came in particular

second. I want only that particular language rest of them i don't want to
display. How to write the query?
select COUNT(*)
FROM <table>
WHERE time = <seconds desired>
and language = <language>;

Default type of JOIN used by Hive?


Hive supports equi joins by default.
You can optimize your join by using Map-side Join or a Merge Join depending upon

up vote 2 the size and sort order of your tables.


down vote
accepted Check this post for more details: Hadoop's Map-side join implements Hash join?

For more details:


https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
What is your strategy for loading data into Hive? Do you perform
incremental loads or create a new table each day?
1. If you are performing incremental loads how do we deal with growing size of
data ?
2. What are you data archiving strategies ?
Usual way is to create new partitions for incremental loads with date as partition
key. This will avoid creation of new tables for each incremental load. With single
table partitioned appropriately, querying and reloading of data will be easier. Try to
avoid nested sub-partitions unlesss required. For example, instead of creating year,
month, and day as nested partitions, create YYYY-MM-DD as single partition.

Create tables with partitions by date. And load data into corresponding partitions
every day. Archive data which is 180 days older.
What is the best studio software/tool to run HIVE SQL/HQL queries by a
data analyst?
Hive command line. Forces me to think through the query properly before I run it.

The best one I have used is not available to the public. It is Facebook's internal UI
for Hive called HiPal.
The thing that sets HiPal aside from all the other tools is that it takes into account
the fact that tools like Hive are batch oriented. Your query is going to run for
minutes or hours. The problem with tools like Hue and others is they don't "set your
query aside" while it runs allowing to you to write and run other queries at the same
time. Any tool querying Hive, Impala, etc should have the functionality to write &
run multiple queries concurrently for the same user in the same session.

Written 5 Nov.
Downvote
How does Hive store tables and is that format accessible to Pig?
I have been reading about Pig and in the paper it is said to work directly on raw
files. How does Hive
Meta data about (Hive) tables is stored in HCatalog, which itself stores the data in a
RDBMS like MySQL. Pig (and other tools) can access these table definitions too. The
data itself can be stored in HDFS or S3 for example.
Note that Hive does a best effort schema on read, i.e. when a query is execute the
data is read from the storage location and deserialised based on the meta data.
That gives you the flexibility to add or remove data directly on the file system
without Hive which is useful when you land data from external systems.
Hive also has two types of table definition - the 'normal' one which means that
dropping a table removes the meta data and the data on the filesystem and an
'external' table definition which means Hive only manages the meta data and
dropping the table leaves the data on the filesystem untouched.
What kind of tools are you using to design your database schema in Hive?
we have tools like ER/Studio ( By Embarcadero) to define relation ships between
tables, similarly do we have something for Bigdata Shops as well ?
.
ER/Studio works for designing Hive schema as well. Of course, we may not be able
to specify certain table specs like external vs internal tables, location, partitions(?)
etc.

Can I know where Hive table schema are stored in Hive?


All metadata information are stored in a meta-store repository which is usually a
database like MySQL or Oracle.

Written 30 Dec.

What's your biggest frustration with Hive?

Hive kicking off mapreduce jobs for tasks that really don't require mapreduce
job.

Not able to register custom UDFs as permanent functions. This is frustrating


where end users have to register udf jar every time they need to use the
custom functions. On the top of it, there are limited builtin UDFs.

No full support for subqueries.

Hive supports creation of external tables only on level one. It won't read or
allow sub-directories.

The biggest of all is convincing people on how it is different from traditional rdbms
and why certain things can't be achieved in hive.

The single most annoying thing that ALWAYS kills me is that when extracting data
from Hive, empty string ('') and NULL are represented differently. NULL extracts as
'\N' which ends up getting loaded into my target systems. And I always seem to
forget this so I then have to go back, make a code change, and then reload the
whole thing.
We cannot delete/insert selective rows in hive...
For example, this querry:
Delete from table tablename where column1=='value'
'value'

No full support for subquerries...

Hive is great for batch processing on big data, not so good at interactive query.

How do i implement hive table partitioned by date, in pig?


Create table x (Id int, name string) partitioned by (date string)
You can use HCatalog that provides a gateway between Hive and Pig.
In practice you can have a partitioned table created in Hive and then
accessed/modified by Pig. For example you can look at this Hadoop tutorial: how to
access Hive in Pig with HCatalog in Hue.

Pig alone doesn't understand the partitioning scheme you've setup in Hive.
But let's assume that you really want to use Pig to add data to a table to be queried
by Hive. This is pretty straight forward process.
1. You would have Pig write to a directory structure with a specific naming
standard (e.g. /home/user/warehouse/x/date=20130105/).
2. Then you Create External Table x (Id int, name string) partitioned by (date
string) LOCATION='/home/user/warehouse/x';
3. Then you tell Hive to recover all partitions. See LanguageManual DDL for
specific syntax.
4. Then you query the hive table x (e.g. select Id, name from x;)

What is the best way to take the back up of hive partitioned table in to a disk?
i have a running table which is partitioned by date. i want to take its backup in form
of file . in case the hdfs blocks where table data is stored got corrupted or missing, i
can restore the same from backup file.
It would be best if you give an indication of how big your partition is.
HDFS is considered fairly reliable from a technical failure point of view, but to
archive data you may want to copy it offline, with copyToLocal, or copy the data to
another cluster or another location on the cluster. For this, you can use DistCp.

Please see Hadoop DistCp Guide for more information.


For copyToLocal the command syntax is:
Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a local file
reference.
See: Hadoop Shell Commands
Additionally, you can use Hive to copy data from a table to a local directory.
Example: INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM
pokes a where year=2013
Note the where clause describes your table partition.
HDFS: How long does it take for the NameNode to start up?
Time to load the FS metadata from disk is affected by several factors, including
number of dfs.namenode.name.dirs configured (since it does a checkpoint in each
one on startup), size of the fsimage, size of the edits, and the speed of the disks
involved. Just anecdotally, I've seen a 500 MB fsimage file with no edits take a few
(< 10) minutes to load. I've seen a large (> 300MB) edits file take ~10 minutes to
load.
I vaguely recall hearing someone at Yahoo once say that one of their 4k clusters
takes something like 30 minutes for the NN to process the metadata on startup,
longer (0.5-2 hours) for block reports.
Also, from Matt Foley in July of 2011 [1]:
Analysis of startup logs of a production cluster with aprx 2000 nodes, with and
without "-upgrade", showed four primary time sinks:

Namenode read/process/write of FSImage and Edits logs (10-15 minutes)


If upgrading, waiting for Datanode snapshots to complete (45+ minutes)

Namenode processing of Datanode registrations and Initial Block Reports (20


minutes to 2 hours)

Delay in full functionality upon leaving Safe Mode, due to high-intensity


processing of over- and under-replicated blocks (several more minutes)

The first issue addressed was the Datanode startup time, primarily for
snapshotting during -upgrade processing. HDFS-1445 reduced this time from
about 10 minutes per volume to aprx 30 seconds per volume. Other potential
improvements are noted in HDFS-1443, but they are an order of magnitude
smaller.

Column Stores: What are the advantages of Trevni over RCFile format?

You may want to look at ORCFile that is being developed as a successor to RCFile in
the context of Hive. The Hive ORCFile Jira (Create a new Optimized Row Columnar
file format for Hive) has detailed notes on the benefits of the new format (and some
comparisons with Trevni as well) - and I think it those would roughly answer this
question as well. Once it stabilizes - most likely ORCFile will become the de-facto
replacement for RCFile - so the comparison between these two might be a more
useful one.

What are the advantages of Hive over SQL?


I'm used to query SQL databases and performing statistical analysis in R. What
would be the main advantages of moving to Hive from SQL? Only increased speed
on querying big databases? And how big is big?

The biggest advantage is developer productivity, though this can come at the
expense of execution speed (mostly latency) and efficiency (high throughput via
brute force).
First I'll point out that HiveQL is SQL, or at least a variant of SQL. And since no
database vendor follows the SQL standard perfectly, they're all variants as far as I'm
concerned. The Tableau connector for Hive supports all of the same important
functionality as any of the other connectors we offer for SQL databases.
Hive does have benefits over other SQL systems implemented in databases. Hive
has several interesting UDF packages and makes it easy to contribute new UDFs.
You can explicitly control the map and reduce transform stages, and even route
them through simple scripts written quickly in languages like Python or Perl. You can
also work with many different serialization formats, making it easy to ingest nearly
any kind of data. Finally, you can connect your Hive queries to several Hadoop
packages for statistical processing, including Apache Mahout, RHipe, RHive or
RHadoop. For these reasons Hive can improve developer productivity when working
with challenging data formats or complex analytical tasks.
When it comes to system performance, Hive has several downsides. First, the batch
nature of Map / Reduce makes Hive perform poorly when you need low-latency
execution for simple queries. I don't have concrete numbers from a real
measurements study, but my sense is that you need data which is at least 10s of
millions of records or data which occupies at least 10s of GB on a modest-sized
cluster before Hive will start outperforming classic databases like PostgreSQL. The
other downside of Hive is that the query compiler and optimizer are still very young.
Not only does the compiler frequently choose mediocre execution plans when joins
or filtering are involved, but there are a tremendous number of tuning knobs,
indexing strategies, and physical layout strategies with bucketing and partitioning,
which all have unclear interactions with each other and which make it very hard to
design your table structures for best performance with a general class of queries.

While the community is working vigorously to address these issues, they are still
missing a lot of functionality commonly present in database systems for supporting
high-performance analytical queries. Other systems like Apache Drill may side-step
some of these issues altogether. It's not clear how much you will gain from Hive
over your existing SQL system in terms of raw performance, but your productivity
may improve. In the meantime keep an eye on the rapid progress the Hadoop
community is making in improving the wholApache Hive: Is there a way to query
only certain nodes within a Hadoop/Hive cluster?
Say I have 3 nodes: 192.168.1.100, 192.168.1.101 and 192.168.1.102.
Is there a way to pull data back from only 2 of the nodes instead of all 3?
I am not sure why you want to do this manually. But to get the same effect, I
suggest you properly partition your data. When the data is ingested HDFS will
take care of distributing it appropriately across the different nodes. When you need
to query data that is properly partitioned then Hive will take care of ensuring that
only the data nodes with the appropriate data will be queried.

How big is your Hive Meta Store? How many Tables do you have in your
Meta Store?
1 upvote by Anonymous.
I dont think there is any limitation at moment. I have seen instances of more than
300 tables but not all of them has huge data though. 1 or 2 has more than 500TB .

I have managed a metastore with thousands of tables and some tables have
hundreds of thousands of partitions a table. At that scale the metastore (mysql
backed) is fairly large ~20GB and you may start needed a decent database machine
to ensure smooth operation.

Hive unable to manually set number of reducers

I have the following hive query: select count(distinct id) as total from mytable;
which automatically spawns: 1408 Mappers 1 Reducer
up vote 11
down vote
favorite
3

I need to manually set the number of reducers and I have tried the following: set
mapred.reduce.tasks=50 set hive.exec.reducers.max=50
but none of these settings seem to be honored. The query takes forever to run. Is
there a way to manually set the reducers or maybe rewrite the query so it can result
in more reducers? Thanks!

How many nodes are you using?


It doesn't matter Tudor, even if he had only reduce slot, he could
still have more reducers. Donald Miner Jan 6 '12 at 17:51
I doubt this is true since you have 1400 mappers, but are you
running in local mode? If so, that'll keep your reducer at 1, I
believe. Donald Miner Jan 6 '12 at 18:04
add a comment

3 Answers
active oldest votes
up vote 18 writing query in hive like this:
down vote
SELECT COUNT(DISTINCT id) ....
accepted

will always result in using only one reducer. You should:


1. use this command to set desired number of reducers:

set mapred.reduce.tasks=50
2. rewrite query as following:
SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;
This will result in 2 map+reduce jobs instead of one, but performance gain will be
substantial.
answered Jan 7 '12 at
14:58

share|improve this
answer

Wojtek
2,86132857
Thanks a lot. This worked! magicalo Jan 9 '12 at 15:22
cool. how comes the hive compiler doesn't do this optimization
(turning into 2 MR jobs) by itself automatically? ihadanny Apr 26
'13 at 21:22
There are situations where turning this into 2 MR jobs isn't an
optimization. For instance, if id is already close to unique and the
table is stored in a columnar file format (like RCFILE), than 1 MR job
would certainly be better. Since situations like that aren't outlandish,
I imagine that's why no one has built this optimization into Hive.
Daniel Koverman May 16 '13 at 19:59
add a comment

You could set the number of reducers spawned per node in the conf/mapred-site.xml config
file. See here: http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html.
In particular, you need to set this property:
mapred.tasktracker.reduce.tasks.maximum

that is applicable for all jobs. If you want to set for a specific query, i think it is
better use set mapred.reduce.tasks brain storm Aug 6 '14 at 17:32

add a comment

Number of reducers depends also on size of the input file


By default it is 1GB (1000000000 bytes). You could change that by setting the
property hive.exec.reducers.bytes.per.reducer:
up vote 0
down
vote

1. either by changing hive-site.xml


hive.exec.reducers.bytes.per.reducer 1000000
2. or using set
hive -e "set hive.exec.reducers.bytes.per.reducer=1000000"

How does Hive choose the number of


reducers for a job?
up vote 9 Several places say the default # of reducers in a Hadoop job is 1. You can use the
down
mapred.reduce.tasks symbol to manually set the number of reducers.
vote
favorite When I run a Hive job (on Amazon EMR, AMI 2.3.3), it has some number of reducers
1

greater than one. Looking at job settings, something has set mapred.reduce.tasks, I
presume Hive. How does it choose that number?
Note: here are some messages while running a Hive job that should be a clue:
...
Number of reduce tasks not specified. Estimated from input data size:
500
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
...

hadoop hive
asked Apr 24 '13 at
22:27

share|improve this
question

edited Apr 25 '13 at


5:55

dfrankow
3,374135686
Good question. Specifically, when does hive choose to do Number of
reduce tasks determined at compile time and when does it choose to
do estimated from input data size? ihadanny Apr 25 '13 at 14:39
added that in the answer below Joydeep Sen Sarma Apr 26 '13 at
1:14
add a comment

1 Answer
active oldest votes

The default of 1 maybe for a vanilla Hadoop install. Hive overrides it.
In open source hive (and EMR likely)
# reducers = (# bytes of input to mappers)
/ (hive.exec.reducers.bytes.per.reducer)

This post says default hive.exec.reducers.bytes.per.reducer is 1G.


up vote 9
You can limit the number of reducers produced by this heuristic using
down
hive.exec.reducers.max.
vote
accepted

If you know exactly the number of reducers you want, you can set
mapred.reduce.tasks, and this will override all heuristics. (By default this is set to
-1, indicating Hive should use its heuristics.)
In some cases - say 'select count(1) from T' - Hive will set the number of reducers to
1 , irrespective of the size of input data. These are called 'full aggregates' - and if the
only thing that the query does is full aggregates - then the compiler knows that the
data from the mappers is going to be reduced to trivial amount and there's no point
running multiple reducers.

How to stop a particular job while running


Hive queries on Hadoop?
up vote
3 down Scenario:
vote
When I run enter the query on Hive CLI, I get the errors as below:
favorite
1

Query:
**$ bin/hive -e "insert overwrite table pokes select a.* from invites
a where a.ds='2008-08-15';"**

Error is like this:


Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0
since there's no reduce operator Starting Job = job_201111291547_0013, Tracking URL
= http://localhost:50030/jobdetails.jsp?jobid=job_201111291547_0013 Kill Command =
C:\cygwin\home\Bhavesh.Shah\hadoop-0.20.2/bin/hadoop job
-Dmapred.job.tracker=localhost:9101 -kill job_201111291547_0013 2011-12-01
14:00:52,380 Stage-1 map = 0%, reduce = 0% 2011-12-01 14:01:19,518 Stage-1 map =
100%, reduce = 100% Ended Job = job_201111291547_0013 with errors FAILED:
Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

Question:

So my question is that how to stop a job? In this case the job is :.

You can stop a job by running hadoop job -kill <job_id>.


answered Dec 4 '11 at
18:11
up vote 9 down vote
accepted

share|improve this
answer

alexlod
635515
add a comment

One more option is to try WebHCat API from browser or command line, using
utilities like Curl. Here's WebHCat API to delete a hive job
up vote 0
Also note that the link says that
down
vote

The job is not immediately deleted, therefore the information returned may not reflect
deletion, as in our example. Use GET jobs/:jobid to monitor the job and confirm that
it is eventually deleted.

How can i increase reducers in hive?


up vote 0
down vote
favorite

select
emp.deptno, emp.ename, emp.empno, emp.job, emp.mgr,
emp.mgr, emp.hiredate, emp.sal, emp.comm, dept.dname,
dept.loc
from emp
join dept on from emp.deptno = dept.deptno;

IT is 9GB data.it is giving problem at reducer.its strucking at reducer 99%. I


have incresed reducer to 150 but it is not giving result.
hadoop hive

edited Sep 11 '13 at asked Sep 11 '13 at


9:26 9:14

share|improve this
question

zero323 user2702383
6,29181729 21
add a comment

2 Answers
active oldest votes

You can use:


set mapred.reduce.tasks=113

And you problem could be related to data skewness (It means some keys are
very dense).
answered Sep 11 '13 at
10:16

up vote 0
down vote
share|improve this
answer

Ashish
3,42411018
add a comment

up vote A skewed join will send a disproportionately large number of values to only one
0 down reducer and you'll get the long tail of a 99% job complete syndrome, so may be
vote
experiencing this. Looking at the job logs (IO specially) will reveal if this is the culprit.

In such cases you can use Skewed Join Optimization, which in turn relies on List
Bucketing. You will have to determine which key values (depno) are heavily skewed
and declare it accordingly in the DDL:
alter table emp (schema) skewed by

(depno) on ('<skewedvalue>');

Read the linked article for details, go over the comments and changes of HIVE-3086.