Vous êtes sur la page 1sur 27

Indexed Hive

A quick demonstration of Hive performance acceleration using indexes


By: Prafulla Tekawade Nikhil Deshpande

www.persistentsys.com

Summary

This presentation describes the performance experiment based on Hive using indexes to accelerate query execution. The slides include information on Indexes A specific set of Group By queries Rewrite technique Performance experiment and results

2010 Persistent Systems Ltd

www.persistentsys.com

Hive usage

HDFS spreads and scatters the data to different locations (data nodes). Data dumped & loaded into HDFS as it is. Only one view to the data, original data structure & layout Typically data is append-only Processing times dominated by full data scan times
Can the data access times be better?

2010 Persistent Systems Ltd

www.persistentsys.com

Hive usage

What can be done to speed-up queries? Cut down the data I/O. Lesser data means faster processing.
Different ways to get performance Columnar storage Data partitioning Indexing (different view of same data)
2010 Persistent Systems Ltd www.persistentsys.com 4

Hive Indexing

Provides key-based data view Keys data duplicated Storage layout favors search & lookup performance Provided better data access for certain operations A cheaper alternative to full data scans! How cheap? An order of magnitude better in certain cases!

2010 Persistent Systems Ltd

www.persistentsys.com

How does the index look like?


An index is a table with 3 columns
hive> describe default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx __; OK Key l_shipdate string _bucketname string References to _offsets array<string> values

Data in index looks like


hive> select * from default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx__ limit 2; OK

1992-01-08 1992-01-16

hdfs://hadoop1:54310/user//lineitem.tbl hdfs://hadoop1:54310/user//lineitem.tbl

["662368"] ["143623","390763","637910"]

2010 Persistent Systems Ltd

www.persistentsys.com

Hive index in HQL

SELECT (mapping, projection, association, given key, fetch value) WHERE (filters on keys) GROUP BY (grouping on keys) JOIN (join key as index key)
Indexes have high potential for accelerating wide range of queries.

2010 Persistent Systems Ltd

www.persistentsys.com

Hive Index
Index as Reference Index as Data This demonstration uses Index as Data technique to show order of magnitude performance gain! Uses Query Rewrite technique to transform queries on base table to index table. Limited applicability currently (e.g. demo based on GB) but technique itself has wide potential. Also a very quick way to demonstrate importance of index for performance (no deep optimizer/execution engine modifications).
2010 Persistent Systems Ltd www.persistentsys.com 8

Indexes and Query Rewrites

Demo targeting: GROUP BY, aggregation Index as Data


Group By Key = Index Key

Query rewritten to use indexes, but still a valid query (nothing special in it!)

2010 Persistent Systems Ltd

www.persistentsys.com

Query Rewrites: simple gb

SELECT DISTINCT l_shipdate FROM lineitem;

SELECT l_shipdate FROM __lineitem_shipdate_idx__;

2010 Persistent Systems Ltd

www.persistentsys.com

10

Query Rewrites: simple agg

SELECT l_shipdate, COUNT(1) FROM lineitem GROUP BY l_shipdate;

SELECT l_shipdate, size(`_offsets`) FROM __lineitem_shipdate_idx__;

2010 Persistent Systems Ltd

www.persistentsys.com

11

Query Rewrites: gb + where


SELECT l_shipdate, COUNT(1) FROM lineitem WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996 GROUP BY l_shipdate;

SELECT l_shipdate, size(` _offsets `) FROM __lineitem_shipdate_idx__ WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996;
2010 Persistent Systems Ltd www.persistentsys.com 12

Query Rewrites: gb on func(key)


SELECT YEAR(l_shipdate) AS Year, COUNT(1) AS Total FROM lineitem GROUP BY YEAR(l_shipdate);

SELECT Year, SUM(cnt) AS Total FROM (SELECT YEAR(l_shipdate) AS Year, size(`_offsets`) AS cnt FROM __lineitem_shipdate_idx__) AS t GROUP BY Year;
2010 Persistent Systems Ltd www.persistentsys.com 13

Histogram Query
SELECT YEAR(l_shipdate) AS MONTH(l_shipdate) AS COUNT(1) AS FROM lineitem GROUP BY YEAR(l_shipdate),
SELECT YEAR(l_shipdate) AS Year,

Year, Month, Monthly_shipments

MONTH(l_shipdate);

MONTH(l_shipdate) AS Month, SUM(sz) FROM AS Monthly_shipments

(SELECT l_shipdate, SIZE(`_offsets`) AS sz FROM __lineitem_shipdate_idx__) AS t

GROUP

BY YEAR(l_shipdate), MONTH(l_shipdate);
www.persistentsys.com 14

2010 Persistent Systems Ltd

Year on Year Query


SELECT y1.Month AS Month, y1.shipments AS Y1_shipments, y2.shipments AS Y2_shipments, (y2_shipments-y1_shipments)/y1_shipments AS Delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Shipments FROM WHERE GROUP lineitem YEAR(l_shipdate) = 1997 BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1 AS Year, MONTH(l_shipdate) AS Month,

JOIN (SELECT YEAR(l_shipdate)

COUNT(1) AS Shipments FROM WHERE GROUP lineitem YEAR(l_shipdate) = 1998 BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2

ON y1.Month = y2.Month;

2010 Persistent Systems Ltd

www.persistentsys.com

15

Year on Year Query


SELECT y1.Month AS Month, y1.shipments AS y1_shipments, y2.shipments AS y2_shipments, ( y2_shipments - y1_shipments ) / y1_shipments AS delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t1 WHERE YEAR(l_shipdate) = 1997 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1
JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t WHERE YEAR(l_shipdate) = 1998 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2 ON y1.Month = y2.Month;

2010 Persistent Systems Ltd

www.persistentsys.com

16

Performance tests
Hardware and software configuration: 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in
RAID5, 16GB RAM)

2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered, Hive tables stored in rowstore format, HDFS replication factor: 2 Hive development branch (~0.5) Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM) Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g.
TPC-H 30GB data: 21GB lineitem, ~180Million tuples)

2010 Persistent Systems Ltd

www.persistentsys.com

17

Perf gain for Histogram Query

Graphs not to scale

(sec) q1_noidx q1_idx


2010 Persistent Systems Ltd

1M
24.161 21.268

1G
76.79 27.292

10G
506.005 35.502

30G
1551.555 86.133
www.persistentsys.com 18

Perf gain for Year on Year Query

Graphs not to scale

(sec) q1_noidx q1_idx


2010 Persistent Systems Ltd

1M
73.66 69.393

1G
130.587 75.493

10G
764.619 92.867

30G
2146.423 190.619
www.persistentsys.com 19

Why index performs better?


Reducing data increases I/O efficiency If you need only X, separate X from the rest Lesser data to process, better memory footprint, better locality of reference Parallelization Process the index data in same manner as base table, distribute the processing across nodes Scalable! Exploiting storage layout optimization Right tool for the job, e.g. two ways to do GROUP BY sort + agg or hash & agg Sort step already done in index!

2010 Persistent Systems Ltd

www.persistentsys.com

20

Near-by future

More rewrites Partitioning Index data per key. Run-time operators for index usage (lookup, join, filter etc., since rewrites only a partial solution). Optimizer support for index operators. Cost based optimizer to choose index and non-index plans.

2010 Persistent Systems Ltd

www.persistentsys.com

21

Index Design

Hive DDL Compiler Hive DDL Engine

Index Builder

Hive Query Compiler Hive Query Engine

Query Rewrite Engine

Hadoop MR

HDFS

2010 Persistent Systems Ltd

www.persistentsys.com

22

Hive Compiler

Parser / AST Generator Semantic Analyzer Query Rewrite Engine

Optimizer / Operator Plan Generator

Execution Plan Generator To Hadoop MR

2010 Persistent Systems Ltd

www.persistentsys.com

23

Query Rewrite Engine

Rule Engine

Query Tree

Rewritten Query Tree


Rewrite Rules Repository
Rewrite Rule Rewrite Rule Rewrite Rewrite Trigger Rewrite Rule Rewrite Action Rewrite Condition Trigger Rewrite Rule Rewrite Action Rewrite Condition Trigger Rewrite Rule Rewrite Action Rewrite Condition Trigger Rewrite Action Rewrite Condition Trigger Action Condition

Rewrite Rule
Rewrite Trigger Condition Rewrite Action

2010 Persistent Systems Ltd

www.persistentsys.com

24

Learning Hive
Hive compiler is not Syntax Directed Translation driven
Tree visitor based, separation of data structs and compiler logic Tree is immutable (harder to change, harder to rewrite) Query semantic information is separately maintained from the query lexical/parse tree, in different data structures, which are loosely bound in a Query Block data structure, which itself is loosely bound to parse tree, yet there doesnt exist a bigger data flow graph off which everything is hung. This makes it very difficult to rewrite queries.

Optimizer is not yet mature


Doesnt handle many obvious opportunities (e.g. sort group by for cases other than base table scans) Optimizer is rule-based, not cost-based, no stats collected Query tuning is harder job (requires special knowledge of the optimizer guts, what works and what doesnt)

Setting up development environment is tedious (build system heavily relies on internet connection, troublesome behind restrictive firewalls). Folks in the community are very active, dependent JIRAs are fast moving target and development-wise, we need to keep up with them actively (e.g. if branching, need to frequently refresh from trunk).

2010 Persistent Systems Ltd

www.persistentsys.com

25

How to get it?


Needs a working Hadoop cluster (tested with 0.20.2) For the Hive with Indexing support: Hive Index DDL patch (JIRA 417) now part of hive trunk https://issues.apache.org/jira/browse/HIVE-417 Get the Hive branch with Index Query Rewrite patch applied from Github (a fork/branch of Hive development tree, a snapshot of Hive + Index DDL source tree, not latest, but single place to get all) http://github.com/prafullat/hive Refer Hive documentation for building http://wiki.apache.org/hadoop/Hive/GettingStarted#Downloading_an d_building See the ql/src/test/queries/client/positive/ql_rewrite_gbtoidx.q test.

2010 Persistent Systems Ltd

www.persistentsys.com

26

Thank You!
prafulla_tekawade at persistent dot co dot in nikhil_deshpande at persistent dot co dot in

2010 Persistent Systems Ltd

www.persistentsys.com

27

Vous aimerez peut-être aussi