Académique Documents
Professionnel Documents
Culture Documents
Agenda
Introduction Facebook Usage Hive Progress and Roadmap Open Source Community
Introduction
Solution: HIVE
Facebook
What is HIVE?
A system for managing and querying structured data built on top of Hadoop
Map-Reduce for execution HDFS for storage Metadata on raw files
Simplifying Hadoop
RDBMS> select key, count(1) from kv1 where key > 100 group by key; vs. hive> select key, count(1) from kv1 where key > 100 group by key;
Facebook Usage
Web Servers
Scribe Servers
Filers
Oracle RAC
Facebook
Federated MySQL
Complex measures of user engagement Ad hoc Analysis Eg: how many group admins broken down by state/country Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Spam Detection Anomalous patterns for Site Integrity Application API usage patterns Ad Optimization
Data statistics:
Source logs/day: 6TB Dimension data/day: 4TB Compression Factor ~5x (gzip)
Usage statistics:
3200 jobs/day with 800K tasks(map-reduce tasks)/day 55TB of compressed data scanned daily 15TB of compressed output data written to hdfs 150 active users within Facebook
CREATE TABLE clicks(key STRING, value STRING) LOCATION '/hive/clicks' PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.TestSerDe' WITH SERDEPROPERTIES ('testserde.default.serialization.format'='\003');
Data Model
Metastore DB Data Location Bucketing Info Partitioning Cols
clicks
Tables
/hive/clicks/ds=2008-03-25
/hive/clicks/ds=2008-03-25/0
HDFS
Facebook
MetaStore
HIVE: Components
Map Reduce
Web UI
HDFS
Hive CLI
Browsing DDL Queries
Thrift API
MetaStore
DB
Facebook
SQL
Subqueries in from clause Equi-joins Multi-table Insert Multi-group-by
Sampling
SELECT s.key, count(1) FROM clicks TABLESAMPLE (BUCKET 1 OUT OF 32) s WHERE s.ds = 2009-04-22 GROUP BY s.key
FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY/user/facebook/tmp/pv_age_sum.dir SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY /home/me/pv_age_sum.dir SELECT age, count(DISTINCT userid) GROUP BY age;
Extensibility
Pluggable Map-reduce scripts Pluggable User Defined Functions Pluggable User Defined Types
Complex object types: List of Maps
Local Map
Machine 2
<k4, v4> <k5, v5> <k6, v6> <nk2, nv4> <nk2, nv5> <nk1, nv6>
Global Shuffle
Local Sort
Local Reduce
<nk2, 3>
Hive QL Join
INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);
1
2 1
111
111 222
9:08:01
9:08:13 9:08:14
111 <2,25>
Map user
userid age 111 222
Facebook
Shuffle Sort
key 111 value <2,25> key 222
Reduce
value <1,1>
25 32
222 <2,32>
222 <2,32>
Join Optimizations
Map Joins
User specified small tables stored in hash tables on the mapper backed by jdbm No reducer needed
INSERT INTO TABLE pv_users SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);
Future
Exploit table/column statistics for deciding strategy
Facebook
key
111 222
value
<1,2> <2> pv_users Pageid 1 age 25 25 32
1
2 1
111
111 222
9:08:01
9:08:13 9:08:14
user
userid age 111 222
Facebook
25 32
Hive QL Group By
pv_users
pageid 1 1 age 25 <1,25> 25 2
key
value
value 2 1
pa
Map
pageid 2 1 age 32 25 key <1,25> <2,32> value 1 1
Shuffle Sort
key value
Reduce
pa
<2,32>
Group by Optimizations
Map side partial aggregations
Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query
SELECT count(1) FROM t;
Columnar Storage
CREATE table columnTable (key STRING, value STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.ColumnarSerDe' STORED AS RCFILE;
QueryA: SELECT count(1) FROM t; QueryB: SELECT concat(concast(concat(a,b),c),d) FROM t; QueryC: SELECT * FROM t;
Time measured is map-side time only (to avoid unstable shuffling time at reducer side). It includes time for decompression and compression (both using GzipCodec). * No performance benchmarks for Map-side Join yet.
Facebook
Reuse objects
Use Writable instead of Java Primitives Reuse objects across all rows *40% speed improvement on Query C
Lazy deserialization
Only deserialize the column when asked Very helpful for complex types (map/list/struct) *108% speed improvement on Query A
Let UDF and UDAF accept complex-type parameters Integrate UDF and UDAF with Writables
public IntWritable evaluate(IntWritable a, IntWritable b) { intWritable.set((int)(a.get() + b.get())); return intWritable; }
HQL Optimizations
Contributors from:
Academia Other web companies Etc..
7 committers
1 external to Facebook and looking to add more here
50 jiras fixed in last month 218 jiras still open 125 mails in last month on hive-user@ 600 mails in last month on hive-dev@ Various companies/universities
Adknowledge, Admob Berkeley, Chinese Academy of Science
Demonstration in VLDB2009
Facebook
Deployment Options
EC2
http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely
Future Work
Benchmark & Performance Integration with BI tools (through JDBC/ODBC) Indexing More on Hive Roadmap
http://wiki.apache.org/hadoop/Hive/Roadmap
Information
Available as a sub project in Hadoop
http://wiki.apache.org/hadoop/Hive(wiki) http://hadoop.apache.org/hive (home page) http://svn.apache.org/repos/asf/hadoop/hive (SVN repo) ##hive (IRC) Works with hadoop-0.17, 0.18, 0.19
Contributors
Aaron Newton Ashish Thusoo David Phillips Dhruba Borthakur Edward Capriolo Eric Hwang Hao Liu He Yongqiang Jeff Hammerbacher Johan Oskarsson Josh Ferguson Joydeep Sen Sarma Kim P. Michi Mutsuzaki Min Zhou Namit Jain Neil Conway Pete Wyckoff Prasad Chakka Raghotham Murthy Richard Lee Shyam Sundar Sarkar Suresh Antony Venky Iyer Zheng Shao
Questions