Vous êtes sur la page 1sur 5


[Hive-User] Specification of SERDE in RCFile

QnaList > Groups > Hive-User > Jan 2013








Specification Of SERDE In RCFile

Discussion Overview

Tagged samples





Group Hive-user

Hi folks,

asked Jan 22 2013 at 07:06

Through samples here and there, I've seen tables definitions using RCFile

active Jan 22 2013 at 07:06


storage specifying a SERDE in somecase, and sometimes not.


ie : sometimes ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFILE
some other times, only :


My question : what happens if SERDE is not specified ? what's the default

Related Groups


behavior ?

Recent questions

Thank you

Power Toggles


[ANNOUNCE] HBase 0.94.25 Is Available

For Download






Svn Commit: R929359 - In /websites:


Reply To : Specification Of SERDE In RCFile

asked Jan 22 2013 at 07:06
Mathieu Despriee

5.0 By Ota
Svn Commit: R929357 - In /websites:
Using The Google Search Feature And S
Proposal To Grow In-app Revenue
Svn Commit: R929355 - In /websites:
Index Complex JSON Data In SOLR

Related discussions

Size Of RCFile In Hive

Hi I tried to convert and merge many small text files using RCFiles using hivesql,but hive produced some small rcfiles. set
hive.exec.compress.output=true; set mapred.output.compress=true; set
mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; set
io.compression.codecs=com.hadoop.compression.lzo.LzoCodec; hive.merge.mapfiles=true hive.merge.mapredfiles=true

Svn Commit: R929353 - In /websites:

Android 5
Asking For An Accessible And Easy To
Perform Upgrading To Android 4.4 For
Samsung Galaxy Express GT-i8730
[k-9-mail] Formatting Send To Addresses
To Include Recipient's Name
Nexus 9 With Ivona

Row Group Size Of RCFile

Svn Commit: R929350 - In /websites:


How do I set the Row Group Size of RCFile in Hive CREATE TABLE OrderFactPartClustRcFile( order_id INT, emp_id INT,
order_amt FLOAT, order_cost FLOAT, qty_sold FLOAT, freight FLOAT, gross_dollar_sales FLOAT, ship_date STRING,
rush_order STRING, customer_id INT, pymt_type INT, shipper_id INT ) PARTITIONED BY (order_date STRING) CLUSTERED BY
(order_id) SORTED BY (order_id)

[mongodb-user] Multiple Number Of


RCFile In Java MapReduce

Can't Find Hi Q Mp3 Recorder

Tasker On Lolipop And Pushover,

Pushbullet And Notifymyandroid Events

Hi, Can someone show me how to use RCfile in plain MapReduce job (as Input and Output Format)? Please.

Read Snappy RCFile In Java

Hi all, I am trying to write java code in Eclipse to read RCFile data which is compressed by snappy codec. The code is like this:
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import
org.apache.hadoop.fs.Path; import org.apache.hadoop.hive.ql.io.RCFile; import
org.apache.hadoop.hive.serde2.ColumnProjectionUtils; public class ReadRCfile

Beeswax And RCFile

Hi, why cant I choose RCFile as dataFormat in Beeswax?

RCFile Performance



[Hive-User] Specification of SERDE in RCFile

Hi Experts, I have a large file with 300+ columns. In order to query only few rows efficiently, I am using RCFile format in Hive. I
have tried setting the RCFile rowgroup size from default size till 32 MB. ex: set hive.io.rcfile.record.buffer.size = 134217728;
However, I do not see major changes in the amount of HDFS data scanned. Moreover, the amount of data scanned with RCFile is
not significantly

ArrayIndexOutOfBoundsException While Writing MapReduce Output As RCFile

That is exactly the fix. Thanks Yin.

Writing To Rcfile
Could someone please point me to someway where I can store in rcfile format with snappy compression? I need to use this
output in hive.

Hi, I want to use RCfile to address the IO problem, and I can not find some paper about how to install or how to use it by PIG, so if
you had some install or configue file, you could share with me. Thank you. Best Regards Malone 2012-05-24

In Pig From An RCFile">"Exploding" A Hive Array In Pig From An RCFile

Hi, I'm storing data into a partitioned table using Hive in RCFile format, but I want to use Pig to do the aggregation of that data. In
my array in Hive, I have colon delimited data, E.g. :0:12:21:99: With the lateral view and explode functions in Hive, I can output
each value as a separate row. In Pig, I think I need to use flatten, but it just outputs the array as a single field, and I

Why Hive SequenceFile/RCFile Are Slow In Impala?

I'm trying to find out which is the most efficient format for my project in Impala. SEQT: Sequence files output by a mapreduce,
with a dummy key 'K', and a tab delimited string value. Both key and value are org.apache.hadoop.io.Text. SEQH: Sequence files
output by a Hive select query against the table using SEQT RC: RCFile format output by a hive select query against the table
using SEQT. PAR:

RCfile Is Not Working With BZip2. Interesting In Using LZO In General.

I'm wondering if my configuration/stack is wrong, or if I'm trying to do something that is not supported in Hive. My goal is to
choose a compression scheme for Hadoop/Hive and while comparing configurations, I'm finding that I can't get BZip2 or Gzip to
wor= k with the RCfile format. Is that supported, i.e. using BZip2 or Gzip with RCfile? LZO appears to be fastest solution, at the
price of not compressing

Impalad Crashed When Query Data Stored In The RCFILE Format Stored In Hive Table
Hi, I have installed impala 0.6 and CDH 4.2, i have setuped my cluster with three data nodes and a namenode. First, i created a
table data stored as TEXTFILE format in hive, And i have loaded about 150 millons rows into the table, I could query data in hive
and in impalad-shell without any errors, But it was too slow query speed(described on

Strange Display When Upload RCFile From Local With Different Column.
hello, I tried to import data from SQL to Hive RCFile. I use RCFile.Writer to generate rcfile and upload to HIVE table directory. the
rcfiles has different columns: c0_1.rc has 2 columns c1_1.rc has 3 columns c2_1.rc has 4 columns the outputs was strange: #
All rc files are loaded: hive> select * from simple; OK 1 foo NULL null 2 bar NULL null 3 foobar NULL null 3 haliluya NULL

RCFile And UDF

I am new to Hive. Currently I am trying out one of the use cases where we write xml files into a sequence file. We then read the
sequence file and convert it into more structured row, col format using pig udf. This is currently being stored as snapp
compression. Now what I want to do is use hive to query data and do self join. But my problem is that file that I need to query on
is in snappy format

RCFile And Hadoop Counters

Hi, I have a question related to the hadoop counters when RCFile is used. I have 16TB of (uncompressed) data stored in
compressed RCFile format. The size of the compressed RCFile is approximately 3 TB. I ran a simple scan query on this table.
Each split is 256 MB (HDFS block size). From the counters of each individual map task I can see the following info:
HDFS_BYTES_READ : 91,235,561 Map input

RCFile And LazyBinarySerDe

Hi all, I'm having a problem, where I'm trying to insert into a table which has ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe', and is STORED AS RCFILE. The exception:
java.lang.UnsupportedOperationException: Currently the writer can only accept BytesRefArrayWritable at
org.apache.hadoop.hive.ql.io.RCFile$Writer.append(RCFile.java:863) at org.apache.hadoop

RCFile With Hive.optimize.cp Setting

In our environment, our data are put on Amazon S3, and data are in RCFile format. In order to make Hive queries work, we found
that we have to change the hive.optimize.cp to false. Otherwise, some queries will fail. Now, when we try some complicated
queries with multiple subqueries and joins, we see queries failed again. But if we run the same query with data Diagnostic
Messages for this




[Hive-User] Specification of SERDE in RCFile

RCFILE And "\n" Characters

The example below shows that the RCFILE SerDe doesn't handle "\n" in string fields correctly. It seem that the SerDe uses "\n"
internally as a record delimiter but it's failing to de/serialize it correctly when it appears within a field. Is that correct? Any ideas
on how to work around that? Thanks, Andre $ echo X > dual.data $ hive hive> CREATE TABLE araujo_sandbox.dual(dummy

Vectorizied Execution On RCFile

Hi All, Vectorization with ORCFile provides amazing performance. Does vectorization work with RCFile as well? As per explain
plan of Hive 0.13 (snapshot), it does not use vectorization with RCFile. Any pointers would be appreciated. ~Rajesh.B

RCfile Format Bug?

There was an error I was facing where Impala can't seem to detect rows of a table whose data is stored in the RCfile format. But
the new release impala 1.1-1.p0.8 seemed to have fixed it. However, now certain queried that did work don't anymore. For
example, tpch query #1 (also stored in rcfile format): select L_RETURNFLAG,

Parquet VS RCfile
Hi all, I'd like to share my simple performance test for comparing 3 different file types(Text, Parquet, RCFile). Environment *
My cluster consists of 8 DNs, and each node is equipped with 24-core CPU, 64 GB memory and 6 disks. Total file size of each file
type TEXT(no compression) PARQUET(snappy) RCFILE(snappy) Total size 58.5Gb 19.2Gb 16.5Gb Num. of files 8 88 236 Num.
of rows 400M 400M

RCFile Test Result Is Strange

hi. I tested impala with rcfile and result is strange. I sent the query after flush the os cache ( run "sync ; echo 3 >
/proc/sys/vm/drop_caches" command in all datanode) and I did same job 6 times to get average. It's my test result. query select col_2 from rcfile_test where col_1 in ('2002-01-01T00:00:00' ) [0] - 30.61s [1] - 32.92s [2] - 32.97s [3] - 36.30s [4] - 30.47s
[5] - 37.71s query

[jira] [Created] (HBASE-7364) Concurrency Issue: RCFile Returns Decompressors Twice

Mikhail Bautin created HBASE-7364: ------------------------------------- Summary: Concurrency issue: RCFile returns decompressors
twice Key: HBASE-7364 URL: https://issues.apache.org/jira/browse/HBASE-7364 Project: HBase Issue Type: Bug Reporter:
Mikhail Bautin Priority: Critical -- This message is automatically

Composite Query On Hbase And Rcfile

Hi, Does anyone know, if hive can run a composite query over RCFILE and HBASE in the same query? Quick anwer will be highly
appreciated Thanks in advance. Rob

Problem With Load Data From Local File System Into RCFILE Table
Hi I have problem with loading data into RCFILE table from local file system. I am using hive 0.7.1 of cloudera's distribution.
1.create table create table test(c1 int,c2 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' Stored as RCFILE; 2.load
text file into table LOAD DATA LOCAL INPATH 'test.txt' INTO TABLE text; Hive command line throwing errors as the following:
Loading data to table

An Error Occurred While Using RCFile On S3

Hi all, I am testing RCFile on S3. I could execute queries which don't specify columns such as "select * from table". But, I could
not execute queries which specify columns such as "select id from table". This job progress to near the end of a map task, but
cannot finish the task as the below log message. 2011-03-22 17:12:04,325 INFO
org.apache.hadoop.fs.s3native.NativeS3FileSystem: Opening key

Complex Types, Lateral View And RCFile

Hi, I have data with complex types (map, struct, array of maps) stored as a text file. I am able to successfully create an external
table based on this data and further build a lateral view on it: =20 hive -e 'select rownum, bag_item from complex_text LATERAL
VIEW explode(bagofmap) explodedTable AS bag_item ;' 1 {"k1":"v1","k2":"v2"} 1 {"k3":"v3","k4":"v4","k5":"v5","k6":"v6"}

Hive Insert Into RCFILE Issue With Timestamp Columns

Hi All, I am using the schema in the Impala VM and trying to create a dynamic partitioned table on date_dim. New table is called
date_dim_i and schema for that is defined as: create table date_dim_i ( d_date_sk int, d_date_id string, d_date timestamp,
d_month_seq int, d_week_seq int, d_quarter_seq

Insert Into ORC Partition From RCFile Partition

Hi, We're trying to convert our fact tables partitioned by date from RCFile to ORCFile. Since they are really big in size and we
retain the last N days (partitions) of data, we don't want to re-process existing partitions. There are two approaches using Hive
ALTER and INSERT commands that I'm comparing. Approach A: 1. Set fileformat of existing table to ORC (from RCFile) 2. Add

Hive Issues RCFile - Binary Fields Corruption Or Field Storage Issues?

I am putting binary data into binary columns in hive and using RCFile. Most data is just fine in my very large table, however



[Hive-User] Specification of SERDE in RCFile

queries over certain time frames get me RCFile/Compression issues. The data goes in fine. Is this a FS level corruption issue? Is
this something tunable? How would I even go about troubleshooting something like this? Hive Runtime Error while processing
writable SEQ -org.

ArrayIndexOutOfBoundsException While Writing MapReduce Output As RCFile

Hi All, I have a scenario where I've to read an RCFile, process it and write the output as an RCFile using a MapReduce program.
My Hadoop version is *CDH 4.2.1 * * * *Mapper* Map Input = LongWritable, BytesRefArrayWritable Map Output = Text,
BytesRefArrayWritable (Record) *******************************CODE BEGINS******************************* //Mapper public static
class AbcMapper

MIN/MAX Issue With Timestamps And RCFILE/ORC Tables

Hi, Because of the known, and believed fixed, issue with MIN/MAX (HIVE-4931), we're using a recent (2013-12-02), locally built
version of Hive 0.13.0-SNAPSHOT. Unfortunately, we're still seeing issues using MIN/MAX on timestamp types when using
RCFILE and ORC formatted tables. I could not find a reference to this problem in the Hive JIRA, but I'm posting here first before
opening a new report

RCFile Vs SequenceFile Vs Text Files

Dear all, We are trying to pick the right data storage format for the Hive table with the following requirement and would really
appreciate any insights you can provide to help our decision. 1. ~50Billion records per month. ~14 columns per record and each
record is ~100 bytes. Table is partitioned by the date. Table gets populated periodically from another Hive query. 2. The columns

Getting Data From Hive Table Stored As Gzip Compressed RCFile

Hi folks, Does beeswax support the compressed Rcfile for hive? I have many tables stored as gzipped Rcfile. Getting data with
CLI or JDBC works fine but beeswax does not. I got an exception like below from select statement: java.io.IOException:
java.io.EOFException Please let me know how can I solve this problem. Using Hive 0.7(trunk), Hue 1.0.1 and CDH3b2. Youngwoo

Querying To Lzo-compressed RCFile Table, "Unknown Codec" Error Occurred.

Hi Impala, CDH and Cloudera Manager users, I'm new to Impala. But I'm trying to measure its querying response time.
Today,during such process,I got following error message from impala-shell. $ impala-shell -i -f Connected to :21000 Query:
select count(distinct column1) from TABLE_A Query aborted, unable to fetch data Backend 6:Unknown Codec:
com.hadoop.compression.lzo.LzopCodec Backend

MapReduce To Process GZip Compressed RCFile

Hi All, I want to access files in HDFS which are GZip compressed RCFiles (stored using HIVE) using a MapReduce program. I
want to process these files and write them back in the same format - (RCFile + GZip) using the same MapReduce program. Can
you please share your experience / thoughts / MapReduce java snippets / links or any pointers pertaining to these ? Thanks in
Advance! Regards, Cyberlord

ArrayIndexOutOfBoundsException While Writing MapReduce Output As RCFile

Hi All, I have a scenario where I've to read an RCFile, process it and write the output as an RCFile using a MapReduce program.
My Hadoop version is CDH 4.2.1 Mapper Map Input = LongWritable, BytesRefArrayWritable Map Output = Text,
BytesRefArrayWritable (Record) *******************************CODE BEGINS******************************* //Mapper public static
class AbcMapper extends

RCFILE Vs Sequence File With Snappy Codec.

Hi Folks, I was using Text files with Snappy compression to create some temporary tables in hive. Got to know its not splittable
which caused Memory to run out for some map jobs. "One thing to note is that Snappy is intended to be used with a container
format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not
splittable" from

HIVE-4014, Hive+RCFile Is Not Doing Column Pruning, CDH Specific

There seems to be a CDH specific bug (it doesn't occur with anything else) that is preventing RCFile from getting columns
pruned properly. This is rather annoying, does anyone know of a fix? https://issues.apache.org/jira/browse/HIVE-4014 Thanks, Marcin

Problem With Memory Limit When Query To Uncompress Rcfile Table Using Impala
Hi, I have a trouble with RCFile when query on it using IMPALA: 1. I using HiBench Tool to create 18Gb uncompress sequence
file and insert into uservisits. 2. Using Hive to create table with format: hive> set mapred.output.compress=false; hive> set
hive.exec.compress.output=false; hive> CREATE TABLE uservisits_rcfile (sourceIP STRING,destURL STRING,visitDate

How To Update On A Huge RCFILE FORMAT PARTITIONED HIVE Table, How To Apply
Deltas(incremental Data).
If you have a Hive table that is RCFILE FORMAT and is partitioned and want to apply updates to it from the deltas that are coming
in how can that be possible please share the ideas which will have the better performance since the whole table is pretty huge
like 10 TB size. having temp tablewhich may be built and swapped may not be good option because of huge volume, dont want to
use the HBASE too



[Hive-User] Specification of SERDE in RCFile

about | faq | contact

2013 QnaList.com . QnaList is part of ZisaTechnologies LLC.



Vous aimerez peut-être aussi