Hive - Data Warehousing &: Analytics On Hadoop

Hive - Data Warehousing & Analytics on Hadoop
Namit Jain, Zheng Shao Facebook
Wednesday, June 10, 2009 Santa Clara Marriott
Agenda
Introduction Facebook Usage Hive Progress and Roadmap Open Source Community
Facebook
Introduction
Facebook
Why Another Data Warehousing System?
Data, data and more data

~1TB per day in March 2008 ~10TB per day today
Facebook
Lets try Hadoop

Pros
Superior in availability/scalability/manageability Efficiency not that great, but throw more hardware Partial Availability/resilience/scale more important than ACID
Cons: Programmability and Metadata

Map-reduce hard to program (users know sql/bash/python) Need to publish data in well known schemas
Solution: HIVE
Facebook
Lets try Hadoop (continued)

RDBMS> select key, count(1) from kv1 where key > 100 group by key;
vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1} $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1} $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh output /tmp/largekey -numReduceTasks 1 $ bin/hadoop dfs cat /tmp/largekey/part*
Facebook
What is HIVE?
A system for managing and querying structured data built on top of Hadoop
Map-Reduce for execution HDFS for storage Metadata on raw files
Key Building Principles:

SQL as a familiar data warehousing tool Extensibility Types, Functions, Formats, Scripts Scalability and Performance
Facebook
Simplifying Hadoop
RDBMS> select key, count(1) from kv1 where key > 100 group by key; vs. hive> select key, count(1) from kv1 where key > 100 group by key;
Facebook
Facebook Usage
Facebook
Data Warehousing at Facebook Today
Web Servers
Scribe Servers
Filers
Oracle RAC
Facebook
Hive on Hadoop Cluster
Federated MySQL
Hive/Hadoop Usage @ Facebook

Types of Applications:
Reporting Eg: Daily/Weekly aggregations of impression/click counts
SELECT pageid, count(1) as imps FROM imp_table GROUP BY pageid WHERE date = 2009-05-01;
Facebook
Complex measures of user engagement Ad hoc Analysis Eg: how many group admins broken down by state/country Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Spam Detection Anomalous patterns for Site Integrity Application API usage patterns Ad Optimization
Hadoop Usage @ Facebook

Cluster Capacity:
600 nodes ~2.4PB (80% used)
Data statistics:
Source logs/day: 6TB Dimension data/day: 4TB Compression Factor ~5x (gzip)
Usage statistics:
3200 jobs/day with 800K tasks(map-reduce tasks)/day 55TB of compressed data scanned daily 15TB of compressed output data written to hdfs 150 active users within Facebook
Facebook
Hive Progress and Roadmap
Facebook
CREATE TABLE clicks(key STRING, value STRING) LOCATION '/hive/clicks' PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.TestSerDe' WITH SERDEPROPERTIES ('testserde.default.serialization.format'='\003');
Facebook
Data Model
Metastore DB Data Location Bucketing Info Partitioning Cols
Hash Partitioning Logical Partitioning

/hive/clicks
clicks
Tables
/hive/clicks/ds=2008-03-25
/hive/clicks/ds=2008-03-25/0
HDFS
Facebook
MetaStore
HIVE: Components
Map Reduce
Web UI
HDFS
Hive CLI
Browsing DDL Queries
Thrift API
Parser Planner Optimizer SerDe Execution
MetaStore
DB
Facebook
Thrift CSV JSON..
Hive Query Language
SQL
Subqueries in from clause Equi-joins Multi-table Insert Multi-group-by
Sampling
SELECT s.key, count(1) FROM clicks TABLESAMPLE (BUCKET 1 OUT OF 32) s WHERE s.ds = 2009-04-22 GROUP BY s.key
Facebook
FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY/user/facebook/tmp/pv_age_sum.dir SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY /home/me/pv_age_sum.dir SELECT age, count(DISTINCT userid) GROUP BY age;
Facebook
Hive Query Language (continued)
Extensibility
Pluggable Map-reduce scripts Pluggable User Defined Functions Pluggable User Defined Types
Complex object types: List of Maps
Pluggable Data Formats

Apache Log Format
Facebook
Pluggable Map-Reduce Scripts

FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script AS dt, uid CLUSTER BY dt) map INSERT INTO TABLE pv_users_reduced REDUCE map.dt, map.uid USING 'reduce_script' AS date, count;
Facebook
Map Reduce Example

Machine 1
<k1, v1> <k2, v2> <k3, v3> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk1, nv1> <nk3, nv3> <nk1, nv6> <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk1, 2> <nk3, 1>
Local Map
Machine 2
<k4, v4> <k5, v5> <k6, v6> <nk2, nv4> <nk2, nv5> <nk1, nv6>
Global Shuffle
Local Sort
Local Reduce
<nk2, nv4> <nk2, nv5> <nk2, nv2>
<nk2, nv4> <nk2, nv5> <nk2, nv2>
<nk2, 3>
Facebook
Hive QL Join
INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Facebook
Hive QL Join in Map Reduce

page_view
pageid userid time key 111 111 222 value <1,1> <1,2> <1,1> key 111 111 value <1,1> <1,2>
1
2 1
111
111 222
9:08:01
9:08:13 9:08:14
111 <2,25>
Map user
userid age 111 222
Facebook
Shuffle Sort
key 111 value <2,25> key 222
Reduce
value <1,1>
gender female male
25 32
222 <2,32>
222 <2,32>
Join Optimizations
Map Joins
User specified small tables stored in hash tables on the mapper backed by jdbm No reducer needed
INSERT INTO TABLE pv_users SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Future
Exploit table/column statistics for deciding strategy
Facebook
Hive QL Map Join

page_view
pageid userid time Hash table
key
111 222
value
<1,2> <2> pv_users Pageid 1 age 25 25 32
1
2 1
111
111 222
9:08:01
9:08:13 9:08:14
user
userid age 111 222
Facebook
2 gender female male 1
25 32
Hive QL Group By
SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;
Facebook
Hive QL Group By in Map Reduce
pv_users
pageid 1 1 age 25 <1,25> 25 2
key
value
key <1,25> <1,25>
value 2 1
pa
Map
pageid 2 1 age 32 25 key <1,25> <2,32> value 1 1
Shuffle Sort
key value
Reduce
pa
<2,32>
Facebook
Group by Optimizations
Map side partial aggregations
Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query
SELECT count(1) FROM t;
Load balancing for data skew

Optimizations being Worked On:
Exploit pre-sorted data for distinct counts Exploit table/column statistics for deciding strategy
Facebook
Columnar Storage
CREATE table columnTable (key STRING, value STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.ColumnarSerDe' STORED AS RCFILE;
Saved 25% of space compared with SequenceFile

Based on one of the largest tables (30 columns) inside Facebook Both are compressed with GzipCodec
Speed improvements in progress

Need to propagate column-selection information to FileFormat
*Contribution from Yongqiang He (outside Facebook)

Facebook
Speed Improvements over Time

Date 2/22/2009 2/23/2009 3/6/2009 4/29/2009 6/3/2009 SVN Revision 746906 747293 751166 770074 781633 Major Changes Before Lazy Deserialization Lazy Deserialization Map-side Aggregation Object Reuse Map-side Join * Query A 83 sec 40 sec 22 sec 21 sec 21 sec Query B 98 sec 66 sec 67 sec 49 sec 48 sec Query C 183 sec 185 sec 182 sec 130 sec 132 sec
QueryA: SELECT count(1) FROM t; QueryB: SELECT concat(concast(concat(a,b),c),d) FROM t; QueryC: SELECT * FROM t;
Time measured is map-side time only (to avoid unstable shuffling time at reducer side). It includes time for decompression and compression (both using GzipCodec). * No performance benchmarks for Map-side Join yet.
Facebook
Overcoming Java Overhead
Reuse objects
Use Writable instead of Java Primitives Reuse objects across all rows *40% speed improvement on Query C
Lazy deserialization
Only deserialize the column when asked Very helpful for complex types (map/list/struct) *108% speed improvement on Query A
Facebook
Generic UDF and UDAF
Let UDF and UDAF accept complex-type parameters Integrate UDF and UDAF with Writables
public IntWritable evaluate(IntWritable a, IntWritable b) { intWritable.set((int)(a.get() + b.get())); return intWritable; }
Facebook
HQL Optimizations
Predicate Pushdown Merging n-way join Column Pruning
Facebook
Open Source Community
Facebook
Open Source Community

21 contributors and growing
6 contributors within Facebook
Contributors from:
Academia Other web companies Etc..
7 committers
1 external to Facebook and looking to add more here
Facebook
50 jiras fixed in last month 218 jiras still open 125 mails in last month on hive-user@ 600 mails in last month on hive-dev@ Various companies/universities
Adknowledge, Admob Berkeley, Chinese Academy of Science
Demonstration in VLDB2009
Facebook
Deployment Options
EC2
http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely
Cloudera Virtual Machine

http://www.cloudera.com/hadoop-training-hive-tutorial
Your own cluster

http://wiki.apache.org/hadoop/Hive/GettingStarted
Hive can directly consume data on hadoop

CREATE EXTERNAL TABLE mytable (key STRING, value STRING) LOCATION '/user/abc/mytable';
Facebook
Future Work
Benchmark & Performance Integration with BI tools (through JDBC/ODBC) Indexing More on Hive Roadmap
http://wiki.apache.org/hadoop/Hive/Roadmap
Machine Learning Integration Real-time Streaming
Facebook
Information
Available as a sub project in Hadoop
http://wiki.apache.org/hadoop/Hive(wiki) http://hadoop.apache.org/hive (home page) http://svn.apache.org/repos/asf/hadoop/hive (SVN repo) ##hive (IRC) Works with hadoop-0.17, 0.18, 0.19
Release 0.3 is out and more are coming Mailing Lists:

hive-{user,dev,commits}@hadoop.apache.org
Facebook
Contributors
Aaron Newton Ashish Thusoo David Phillips Dhruba Borthakur Edward Capriolo Eric Hwang Hao Liu He Yongqiang Jeff Hammerbacher Johan Oskarsson Josh Ferguson Joydeep Sen Sarma Kim P. Michi Mutsuzaki Min Zhou Namit Jain Neil Conway Pete Wyckoff Prasad Chakka Raghotham Murthy Richard Lee Shyam Sundar Sarkar Suresh Antony Venky Iyer Zheng Shao
Facebook
Questions
Facebook

Hive - Data Warehousing &: Analytics On Hadoop

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Hive - Data Warehousing &: Analytics On Hadoop

Transféré par

Droits d'auteur :

Formats disponibles

Hive - Data Warehousing & Analytics on Hadoop

Namit Jain, Zheng Shao Facebook

Wednesday, June 10, 2009 Santa Clara Marriott

Why Another Data Warehousing System?

Data, data and more data

Lets try Hadoop

Cons: Programmability and Metadata

Lets try Hadoop (continued)

Key Building Principles:

Data Warehousing at Facebook Today

Hive on Hadoop Cluster

Hive/Hadoop Usage @ Facebook

Hadoop Usage @ Facebook

Hive Progress and Roadmap

Hash Partitioning Logical Partitioning

Parser Planner Optimizer SerDe Execution

Thrift CSV JSON..

Hive Query Language

Hive Query Language (continued)

Pluggable Data Formats

Pluggable Map-Reduce Scripts

Map Reduce Example

<nk2, nv4> <nk2, nv5> <nk2, nv2>

<nk2, nv4> <nk2, nv5> <nk2, nv2>

Hive QL Join in Map Reduce

gender female male

Hive QL Map Join

2 gender female male 1

SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;

Hive QL Group By in Map Reduce

key <1,25> <1,25>

Load balancing for data skew

Saved 25% of space compared with SequenceFile

Speed improvements in progress

*Contribution from Yongqiang He (outside Facebook)

Speed Improvements over Time

Overcoming Java Overhead

Generic UDF and UDAF

Predicate Pushdown Merging n-way join Column Pruning

Open Source Community

Open Source Community

Cloudera Virtual Machine

Your own cluster

Hive can directly consume data on hadoop

Machine Learning Integration Real-time Streaming

Release 0.3 is out and more are coming Mailing Lists:

Vous aimerez peut-être aussi