Académique Documents
Professionnel Documents
Culture Documents
www.persistentsys.com
Summary
This presentation describes the performance experiment based on Hive using indexes to accelerate query execution. The slides include information on Indexes A specific set of Group By queries Rewrite technique Performance experiment and results
www.persistentsys.com
Hive usage
HDFS spreads and scatters the data to different locations (data nodes). Data dumped & loaded into HDFS as it is. Only one view to the data, original data structure & layout Typically data is append-only Processing times dominated by full data scan times
Can the data access times be better?
www.persistentsys.com
Hive usage
What can be done to speed-up queries? Cut down the data I/O. Lesser data means faster processing.
Different ways to get performance Columnar storage Data partitioning Indexing (different view of same data)
2010 Persistent Systems Ltd www.persistentsys.com 4
Hive Indexing
Provides key-based data view Keys data duplicated Storage layout favors search & lookup performance Provided better data access for certain operations A cheaper alternative to full data scans! How cheap? An order of magnitude better in certain cases!
www.persistentsys.com
1992-01-08 1992-01-16
hdfs://hadoop1:54310/user//lineitem.tbl hdfs://hadoop1:54310/user//lineitem.tbl
["662368"] ["143623","390763","637910"]
www.persistentsys.com
SELECT (mapping, projection, association, given key, fetch value) WHERE (filters on keys) GROUP BY (grouping on keys) JOIN (join key as index key)
Indexes have high potential for accelerating wide range of queries.
www.persistentsys.com
Hive Index
Index as Reference Index as Data This demonstration uses Index as Data technique to show order of magnitude performance gain! Uses Query Rewrite technique to transform queries on base table to index table. Limited applicability currently (e.g. demo based on GB) but technique itself has wide potential. Also a very quick way to demonstrate importance of index for performance (no deep optimizer/execution engine modifications).
2010 Persistent Systems Ltd www.persistentsys.com 8
Query rewritten to use indexes, but still a valid query (nothing special in it!)
www.persistentsys.com
www.persistentsys.com
10
www.persistentsys.com
11
SELECT l_shipdate, size(` _offsets `) FROM __lineitem_shipdate_idx__ WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996;
2010 Persistent Systems Ltd www.persistentsys.com 12
SELECT Year, SUM(cnt) AS Total FROM (SELECT YEAR(l_shipdate) AS Year, size(`_offsets`) AS cnt FROM __lineitem_shipdate_idx__) AS t GROUP BY Year;
2010 Persistent Systems Ltd www.persistentsys.com 13
Histogram Query
SELECT YEAR(l_shipdate) AS MONTH(l_shipdate) AS COUNT(1) AS FROM lineitem GROUP BY YEAR(l_shipdate),
SELECT YEAR(l_shipdate) AS Year,
MONTH(l_shipdate);
GROUP
BY YEAR(l_shipdate), MONTH(l_shipdate);
www.persistentsys.com 14
COUNT(1) AS Shipments FROM WHERE GROUP lineitem YEAR(l_shipdate) = 1998 BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2
ON y1.Month = y2.Month;
www.persistentsys.com
15
www.persistentsys.com
16
Performance tests
Hardware and software configuration: 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in
RAID5, 16GB RAM)
2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered, Hive tables stored in rowstore format, HDFS replication factor: 2 Hive development branch (~0.5) Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM) Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g.
TPC-H 30GB data: 21GB lineitem, ~180Million tuples)
www.persistentsys.com
17
1M
24.161 21.268
1G
76.79 27.292
10G
506.005 35.502
30G
1551.555 86.133
www.persistentsys.com 18
1M
73.66 69.393
1G
130.587 75.493
10G
764.619 92.867
30G
2146.423 190.619
www.persistentsys.com 19
www.persistentsys.com
20
Near-by future
More rewrites Partitioning Index data per key. Run-time operators for index usage (lookup, join, filter etc., since rewrites only a partial solution). Optimizer support for index operators. Cost based optimizer to choose index and non-index plans.
www.persistentsys.com
21
Index Design
Index Builder
Hadoop MR
HDFS
www.persistentsys.com
22
Hive Compiler
www.persistentsys.com
23
Rule Engine
Query Tree
Rewrite Rule
Rewrite Trigger Condition Rewrite Action
www.persistentsys.com
24
Learning Hive
Hive compiler is not Syntax Directed Translation driven
Tree visitor based, separation of data structs and compiler logic Tree is immutable (harder to change, harder to rewrite) Query semantic information is separately maintained from the query lexical/parse tree, in different data structures, which are loosely bound in a Query Block data structure, which itself is loosely bound to parse tree, yet there doesnt exist a bigger data flow graph off which everything is hung. This makes it very difficult to rewrite queries.
Setting up development environment is tedious (build system heavily relies on internet connection, troublesome behind restrictive firewalls). Folks in the community are very active, dependent JIRAs are fast moving target and development-wise, we need to keep up with them actively (e.g. if branching, need to frequently refresh from trunk).
www.persistentsys.com
25
www.persistentsys.com
26
Thank You!
prafulla_tekawade at persistent dot co dot in nikhil_deshpande at persistent dot co dot in
www.persistentsys.com
27