Vous êtes sur la page 1sur 6

PIG VS

HIVE
VS
Native
Map
Reduc
e

Complex branching logic which has a lot of nested if .. else .. structures is easier and
quicker to implement in Standard MapReduce, for processing structured data you could use
Pangool, it also simplifies things like JOIN. Also Standard MapReduce gives you full control
to minimize the number of MapReduce jobs that your data processing flow requires, which
translates into performance. But it requires more time to code and introduce changes.
Apache PIG is good for structured data too, but its advantage is the ability to work with
BAGs of data (all rows that are grouped on a key), it is simpler to implement things like:
1

Get top N elements for each group;

Calculate total per each group and than put that total against each row in the group;

Use Bloom filters for JOIN optimisations;

Multiquery support (it is when PIG tries to minimise the number on MapReduce Jobs
by doing more stuff in a single Job)

Hive is better suited for ad-hoc queries, but its main advantage is that it has engine that
stores and partitions data. But its tables can be read from PIG or Standard MapReduce.
One more thing, Hive and PIG are not well suited to work with hierarchical data.

5down
vote

Short answer - We need MapReduce when we need very deep level and fine
grained control on the way we want to process our data. Sometimes, it is not very
convenient to express what we need exactly in terms of Pig and Hive queries.
It should not be totally impossible to do, what you can using MapReduce, through
Pig or Hive. With the level of flexibility provided by Pig and Hive you can
somehow manage to achieve your goal, but it might be not that smooth. You
could write UDFs or do something and achieve that.
There is no clear distinction as such among the usage of these tools. It totally
depends on your particular use-case. Based on your data and the kind of
processing you need to decide which tool fits into your requirements better.
Edit :
Sometime ago I had a use case wherein I had to collect seismic data and run
some analytics on it. The format of the files holding this data was somewhat
weird. Some part of the data was EBCDIC encoded, while rest of the data was in
binary format. It was basically a flat binary file with no delimiters like\n or
something. I had a tough time finding some way to process these files using Pig

or Hive. As a result I had to settle down with MR. Initially it took time, but
gradually it became smoother as MR is really swift once you have the basic
template ready with you.
So, like I said earlier it basically depends on your use case. For example,
iterating over each record of your dataset is really easy in Pig(just a foreach), but
what if you need foreach n?? So, when you need "that" level of control over the
way you need to process your data, MR is more suitable.
Another situation might be when you data is hierarchical rather than row-based or
if your data is highly unstructured.
Metapatterns problem involving job chaining and job merging are easier to solve
using MR directly rather than using Pig/Hive.
And sometimes it is very very convenient to accomplish a particular task using
some xyz tool as compared to do it using Pig/hive. IMHO, MR turns out to be
better in such situations as well. For example if you need to do some statistical
analyses on your BigData, R used with Hadoop streaming is probably the best
option to go with.

Pig: a dataflow language and environment for exploring very large datasets.
Hive: a distributed data warehouse

1down
vote

Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose
limits on what Hive can do.
Hive is most suited for data warehouse applications, where
1) Relatively static data is analyzed,
2) Fast response times are not required, and
3) When the data is not changing rapidly.
Hive doesnt provide crucial features required for OLTP, Online Transaction Processing. Its
closer to being an OLAP tool, Online Analytic Processing. So, Hive is best suited for data
warehouse applications, where a large data set is maintained and mined for insights, reports,
etc.

Simpler words, Pig is a high-level platform for creating MapReduce programs used with
0dow InHadoop,
using pig scripts we will process the large amount of data into desired format.
n vote Once the processed data obtained, this processed data is kept in HDFS for later processing to
obtain the desired results.

On top of the stored processed data we will apply HIVE SQL commands to get the desired
results, internally this hive sql commands runs MAP Reduce programs.
allows one to load data and user code at any point in the pipeline. This is can be particularly
0dow Pig
important if the data is a streaming data, for example data from satellites or instruments.
n vote Hive, which is RDBMS based, needs the data to be first imported (or loaded) and after that it
can be worked upon. So if you were using Hive on streaming data, you would have to keep
filling buckets (or files) and use hive on each filled bucket, while using other buckets to keep
storing the newly arriving data.
Pig also uses lazy evaluation. It allows greater ease of programming and one can use it to
analyze data in different ways with more freedom than in an SQL like language like Hive. So if
you really wanted to analyze matrices or patterns in some unstructured data you had, and
wanted to do interesting calculations on them, with Pig you can go some fair distance, while with
Hive, you need something else to play with the results.
Pig is faster in the data import but slower in actual execution than an RDBMS friendly language
like Hive.
Pig is well suited to parallelization and so it possibly has an edge for systems where the
datasets are huge, i.e. in systems where you are concerned more about the throughput of your
results than the latency (the time to get any particular datum of result).
Mapreduce
Strength:
works both on structured and unstructured data.,
business logic.
Weakness:

good for writing complex

long development type, hard to achieve join functionality

Hive :
Strength:
weakness :

less development time., suitable for adhoc analysis., easy for joins
not easy for complex business logic. And deals only structured data.

Pig
Strength : Structured and unstructured data., joins are easily written.
Weakness:

new language to learn.,converted into mapreduce.

ted

By Hadoop, I guess you are referring to MapReduce? Hadoop as such is an ecosystem which
consists of many components (including MapReduce, HDFS, Pig and Hive).
MapReduce is good when you need to write the logic for processing data at the Map() and
Reduce() method level. In my work, I find MapReduce very useful when I'm dealing with data
that is unstructured & needs to be cleansed.
Hive,Pig: They are good for batch processes, running periodically (maybe in terms of hours or
days) HBase & Cassandra: Support low latency calls. So they can be used for real time
applications, where response time is key. Have a look at this discussion to get a better idea
about HBase vs Cassandra.

When to use HBase and when to use Hive

48down MapReduce is just a computing framework. HBase has nothing to do with it. That said, you
vote

accepted

can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you
can write sequential programs using other HBase APIs, such as Java, to put or fetch the data.
But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make
much sense. Using normal sequential programs would be highly inefficient when your data is
too huge.
Coming back to the first part of your question, Hadoop is basically 2 things - a Distributed
FileSystem(HDFS)+a Computation or Processing framework(MapReduce). Like all other
FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and
lower risk of data loss(because of the replication). But, being a FS, HDFS lacks random read
and write accees. This is where HBase comes into picture. It's a distributed, scalable, big
data store, modelled after Google's BigTable. It stores data as key/value pairs.
Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop
cluster. Along with that it provides an SQL like interface which makes your work easier, in case
you are coming from an SQL background. You can create tables in Hive and store data there.
Along with that you can even map your existing HBase tables to Hive and operate on them.
While Pig is basically a dataflow language that allows us to process enormous amounts of
data very easily and quickly.Pig basically has 2 parts, the Pig Interpreter and the language,
PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes
our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it
can really become a pain.
** Both Hive and Pig queries get converted into MapReduce jobs under the hood

http://smartdatacollective.com/mtariq/120791/hadoop-toolbox-when-use-what
1- Hadoop : Hadoop is basically 2 things, a distributed file system (HDFS) which constitutes
Hadoop's storage layer and a distributed computation framework(MapReduce) which constitutes
the processing layer. You should go for Hadoop if your data is very huge and you have offline,
batch processing kind needs. Hadoop is not suitable for real time stuff. You setup a Hadoop
cluster on a group of commodity machines connected together over a network(called as a
cluster). And then store huge amounts of data (in GB or more) into the HDFS and process this
data by writing MapReduce programs(or jobs). Being distributed, HDFS is spread across all the
machines in a cluster and MapReduce processes this scattered data locally by going to each
machine, so that you don't have to relocate this huge amount of data.

2- Hive : Originally developed by Facebook, Hive is basically a data warehouse. It sits on top of
your Hadoop cluster and provides you an SQL like interface to the data stored in your Hadoop
cluster. You can then write SQLish queries using Hive's query language, called as HiveQL and
perform operations like store, select, join, and much more. It makes processing a lot easier as you
don't have to do lengthy, tedious coding. Write simple Hive queries and get the results Simply
map HDFS files to Hive tables and start querying the data. Not only this, you could map Hbase
tables as well, and operate on that data.
Tip : Use Hive when you have warehousing needs and you are good at SQL and don't want to
write MapReduce jobs. One important point though, Hive queries get converted into a
corresponding MapReduce job under the hood which runs on your cluster and gives you the
result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL.
Sometimes, if you need really fine grained and complex processing you might have to take
MapReduce's shelter.
3- Hbase : Hbase is a distributed, scalable, big data store, modelled after Google's BigTable. It
stores data as key/value pairs. It's basically a database, a NoSQL database and like any other
database it's biggest advantage is that it provides you random read/write capabilities. As I have
mentioned earlier, Hadoop is not very good for your real time needs, so you can use Hbase to
serve that purpose. If you have some data which you want to access real time, you could store it
in Hbase. Hbase has got it's own set of very good API which could be used to push/pull the data.
Not only this, Hbase can be seamlessly integrated with MapReduce so that you can do bulk
operation, like indexing, analytics etc etc.
Tip : You could use Hadoop as the repository for your static data and Hbase as the datastore
which will hold data that is probably gonna change over time after some processing.
4- Pig : Pig is a dataflow language that allows you to process enormous amounts of data very
easily and quickly by repeatedly transforming it in steps. It basically has 2 parts, the
PigInterpreter and the language, PigLatin. Pig was originally developed at Yahoo and they use it
extensively. Like Hive, PigLatin queries also get converted into a MapReduce job and give you the
result. You can use Pig for data stored both in HDFS and Hbase very conveniently. Just like Hive,

Pig is also really efficient at what it is meant to do. It saves a lot of your effort and time by allowing
you to not write MapReduce programs and do the operation through straightforward Pig queries.
Tip : Use Pig when you want to do a lot of transformations on your data and don't want to take the
pain of writing MapReduce jobs.
5- Sqoop : Sqoop is a tool that allows you to transfer data between relational databases and
Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved
jobs which can be run multiple times to import updates made to a database since the last import.
Not only this, imports can also be used to populate tables in Hive or HBase. Along with this
Sqoop also allows you to export the data back into the relational database from the cluster.
Tip : Use Sqoop when you have lots of legacy data and you want it to be stored and processed
over your Hadoop cluster or when you want to incrementally add the data to your existing storage.
6- Oozie : Now you have everything in place and want to do the processing but find it crazy to
start the jobs and manage the workflow manually all the time. Specially in the cases when it is
required to chain multiple MapReduce jobs together to achieve a goal. You would like to have
some way to automate all this. No worries, Oozie comes to the rescue. It is a scalable, reliable
and extensible workflow scheduler system. You just define your workflows(which are Directed
Acyclical Graphs) once and rest is taken care by Oozie. You can schedule MapReduce jobs, Pig
jobs, Hive jobs, Sqoop imports and even your Java programs using Oozie.
Tip : Use Oozie when you have a lot of jobs to run and want some efficient way to automate
everything based on some time (frequency) and data availabilty.
7- Flume/Chukwa : Both Flume and Chukwa are data aggregation tools and allow you to
aggregate data in an efficient, reliable and distributed manner. You can pick data from some place
and dump it into your cluster. Since you are handling BigData, it makes more sense to do it in a
distributed and parallel fashion which both these tools are very good at. You just have to define
your flows and feed them to these tools and rest of things will be done automatically by them.
Tip : Go for Flume/Chukwa when you have to aggregate huge amounts of data into your Hadoop
environment in a distributed and parallel manner.

Vous aimerez peut-être aussi