Indexedhive 101015061834 Phpapp01

Indexed Hive
A quick demonstration of Hive performance acceleration using indexes

By: Prafulla Tekawade Nikhil Deshpande
www.persistentsys.com
Summary
This presentation describes the performance experiment based on Hive using indexes to accelerate query execution. The slides include information on Indexes A specific set of Group By queries Rewrite technique Performance experiment and results
2010 Persistent Systems Ltd
Hive usage
HDFS spreads and scatters the data to different locations (data nodes). Data dumped & loaded into HDFS as it is. Only one view to the data, original data structure & layout Typically data is append-only Processing times dominated by full data scan times
Can the data access times be better?
Hive usage
What can be done to speed-up queries? Cut down the data I/O. Lesser data means faster processing.
Different ways to get performance Columnar storage Data partitioning Indexing (different view of same data)
2010 Persistent Systems Ltd www.persistentsys.com 4
Hive Indexing
Provides key-based data view Keys data duplicated Storage layout favors search & lookup performance Provided better data access for certain operations A cheaper alternative to full data scans! How cheap? An order of magnitude better in certain cases!
How does the index look like?

An index is a table with 3 columns
hive> describe default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx __; OK Key l_shipdate string _bucketname string References to _offsets array<string> values
Data in index looks like

hive> select * from default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx__ limit 2; OK
1992-01-08 1992-01-16
hdfs://hadoop1:54310/user//lineitem.tbl hdfs://hadoop1:54310/user//lineitem.tbl
["662368"] ["143623","390763","637910"]
Hive index in HQL
SELECT (mapping, projection, association, given key, fetch value) WHERE (filters on keys) GROUP BY (grouping on keys) JOIN (join key as index key)
Indexes have high potential for accelerating wide range of queries.
Hive Index
Index as Reference Index as Data This demonstration uses Index as Data technique to show order of magnitude performance gain! Uses Query Rewrite technique to transform queries on base table to index table. Limited applicability currently (e.g. demo based on GB) but technique itself has wide potential. Also a very quick way to demonstrate importance of index for performance (no deep optimizer/execution engine modifications).
Indexes and Query Rewrites
Demo targeting: GROUP BY, aggregation Index as Data

Group By Key = Index Key
Query rewritten to use indexes, but still a valid query (nothing special in it!)
Query Rewrites: simple gb
SELECT DISTINCT l_shipdate FROM lineitem;
SELECT l_shipdate FROM __lineitem_shipdate_idx__;
10
Query Rewrites: simple agg
SELECT l_shipdate, COUNT(1) FROM lineitem GROUP BY l_shipdate;
SELECT l_shipdate, size(`_offsets`) FROM __lineitem_shipdate_idx__;
11
Query Rewrites: gb + where

SELECT l_shipdate, COUNT(1) FROM lineitem WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996 GROUP BY l_shipdate;
SELECT l_shipdate, size(` _offsets `) FROM __lineitem_shipdate_idx__ WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996;
Query Rewrites: gb on func(key)

SELECT YEAR(l_shipdate) AS Year, COUNT(1) AS Total FROM lineitem GROUP BY YEAR(l_shipdate);
SELECT Year, SUM(cnt) AS Total FROM (SELECT YEAR(l_shipdate) AS Year, size(`_offsets`) AS cnt FROM __lineitem_shipdate_idx__) AS t GROUP BY Year;
Histogram Query
SELECT YEAR(l_shipdate) AS MONTH(l_shipdate) AS COUNT(1) AS FROM lineitem GROUP BY YEAR(l_shipdate),
SELECT YEAR(l_shipdate) AS Year,
Year, Month, Monthly_shipments
MONTH(l_shipdate);
MONTH(l_shipdate) AS Month, SUM(sz) FROM AS Monthly_shipments
(SELECT l_shipdate, SIZE(`_offsets`) AS sz FROM __lineitem_shipdate_idx__) AS t
GROUP
BY YEAR(l_shipdate), MONTH(l_shipdate);
www.persistentsys.com 14
Year on Year Query

SELECT y1.Month AS Month, y1.shipments AS Y1_shipments, y2.shipments AS Y2_shipments, (y2_shipments-y1_shipments)/y1_shipments AS Delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Shipments FROM WHERE GROUP lineitem YEAR(l_shipdate) = 1997 BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1 AS Year, MONTH(l_shipdate) AS Month,
JOIN (SELECT YEAR(l_shipdate)
COUNT(1) AS Shipments FROM WHERE GROUP lineitem YEAR(l_shipdate) = 1998 BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2
ON y1.Month = y2.Month;
15
Year on Year Query

SELECT y1.Month AS Month, y1.shipments AS y1_shipments, y2.shipments AS y2_shipments, ( y2_shipments - y1_shipments ) / y1_shipments AS delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t1 WHERE YEAR(l_shipdate) = 1997 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1
JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t WHERE YEAR(l_shipdate) = 1998 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2 ON y1.Month = y2.Month;
16
Performance tests
Hardware and software configuration: 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in
RAID5, 16GB RAM)
2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered, Hive tables stored in rowstore format, HDFS replication factor: 2 Hive development branch (~0.5) Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM) Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g.
TPC-H 30GB data: 21GB lineitem, ~180Million tuples)
17
Perf gain for Histogram Query
Graphs not to scale
(sec) q1_noidx q1_idx

1M
24.161 21.268
1G
76.79 27.292
10G
506.005 35.502
30G
1551.555 86.133
Perf gain for Year on Year Query
Graphs not to scale
(sec) q1_noidx q1_idx

1M
73.66 69.393
1G
130.587 75.493
10G
764.619 92.867
30G
2146.423 190.619
Why index performs better?

Reducing data increases I/O efficiency If you need only X, separate X from the rest Lesser data to process, better memory footprint, better locality of reference Parallelization Process the index data in same manner as base table, distribute the processing across nodes Scalable! Exploiting storage layout optimization Right tool for the job, e.g. two ways to do GROUP BY sort + agg or hash & agg Sort step already done in index!
20
Near-by future
More rewrites Partitioning Index data per key. Run-time operators for index usage (lookup, join, filter etc., since rewrites only a partial solution). Optimizer support for index operators. Cost based optimizer to choose index and non-index plans.
21
Index Design
Hive DDL Compiler Hive DDL Engine
Index Builder
Hive Query Compiler Hive Query Engine
Query Rewrite Engine
Hadoop MR
HDFS
22
Hive Compiler
Parser / AST Generator Semantic Analyzer Query Rewrite Engine
Optimizer / Operator Plan Generator
Execution Plan Generator To Hadoop MR
23
Query Rewrite Engine
Rule Engine
Query Tree
Rewritten Query Tree

Rewrite Rules Repository
Rewrite Rule Rewrite Rule Rewrite Rewrite Trigger Rewrite Rule Rewrite Action Rewrite Condition Trigger Rewrite Rule Rewrite Action Rewrite Condition Trigger Rewrite Rule Rewrite Action Rewrite Condition Trigger Rewrite Action Rewrite Condition Trigger Action Condition
Rewrite Rule
Rewrite Trigger Condition Rewrite Action
24
Learning Hive
Hive compiler is not Syntax Directed Translation driven
Tree visitor based, separation of data structs and compiler logic Tree is immutable (harder to change, harder to rewrite) Query semantic information is separately maintained from the query lexical/parse tree, in different data structures, which are loosely bound in a Query Block data structure, which itself is loosely bound to parse tree, yet there doesnt exist a bigger data flow graph off which everything is hung. This makes it very difficult to rewrite queries.
Optimizer is not yet mature

Doesnt handle many obvious opportunities (e.g. sort group by for cases other than base table scans) Optimizer is rule-based, not cost-based, no stats collected Query tuning is harder job (requires special knowledge of the optimizer guts, what works and what doesnt)
Setting up development environment is tedious (build system heavily relies on internet connection, troublesome behind restrictive firewalls). Folks in the community are very active, dependent JIRAs are fast moving target and development-wise, we need to keep up with them actively (e.g. if branching, need to frequently refresh from trunk).
25
How to get it?

Needs a working Hadoop cluster (tested with 0.20.2) For the Hive with Indexing support: Hive Index DDL patch (JIRA 417) now part of hive trunk https://issues.apache.org/jira/browse/HIVE-417 Get the Hive branch with Index Query Rewrite patch applied from Github (a fork/branch of Hive development tree, a snapshot of Hive + Index DDL source tree, not latest, but single place to get all) http://github.com/prafullat/hive Refer Hive documentation for building http://wiki.apache.org/hadoop/Hive/GettingStarted#Downloading_an d_building See the ql/src/test/queries/client/positive/ql_rewrite_gbtoidx.q test.
26
Thank You!
prafulla_tekawade at persistent dot co dot in nikhil_deshpande at persistent dot co dot in
27

Indexedhive 101015061834 Phpapp01

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Indexedhive 101015061834 Phpapp01

Transféré par

Droits d'auteur :

Formats disponibles

Indexed Hive

A quick demonstration of Hive performance acceleration using indexes

2010 Persistent Systems Ltd

2010 Persistent Systems Ltd

2010 Persistent Systems Ltd

How does the index look like?

Data in index looks like

2010 Persistent Systems Ltd

Hive index in HQL

2010 Persistent Systems Ltd

Indexes and Query Rewrites

Demo targeting: GROUP BY, aggregation Index as Data

2010 Persistent Systems Ltd

Query Rewrites: simple gb

SELECT DISTINCT l_shipdate FROM lineitem;

SELECT l_shipdate FROM __lineitem_shipdate_idx__;

2010 Persistent Systems Ltd

Query Rewrites: simple agg

SELECT l_shipdate, COUNT(1) FROM lineitem GROUP BY l_shipdate;

SELECT l_shipdate, size(`_offsets`) FROM __lineitem_shipdate_idx__;

2010 Persistent Systems Ltd

Query Rewrites: gb + where

Query Rewrites: gb on func(key)

Year, Month, Monthly_shipments

MONTH(l_shipdate) AS Month, SUM(sz) FROM AS Monthly_shipments

(SELECT l_shipdate, SIZE(`_offsets`) AS sz FROM __lineitem_shipdate_idx__) AS t

2010 Persistent Systems Ltd

Year on Year Query

JOIN (SELECT YEAR(l_shipdate)

2010 Persistent Systems Ltd

Year on Year Query

2010 Persistent Systems Ltd

2010 Persistent Systems Ltd

Perf gain for Histogram Query

Graphs not to scale

(sec) q1_noidx q1_idx

Perf gain for Year on Year Query

Graphs not to scale

(sec) q1_noidx q1_idx

Why index performs better?

2010 Persistent Systems Ltd

2010 Persistent Systems Ltd

Hive DDL Compiler Hive DDL Engine

Hive Query Compiler Hive Query Engine

Query Rewrite Engine

2010 Persistent Systems Ltd

Parser / AST Generator Semantic Analyzer Query Rewrite Engine

Optimizer / Operator Plan Generator

Execution Plan Generator To Hadoop MR

2010 Persistent Systems Ltd

Query Rewrite Engine

Rewritten Query Tree

2010 Persistent Systems Ltd

Optimizer is not yet mature

2010 Persistent Systems Ltd

How to get it?

2010 Persistent Systems Ltd

2010 Persistent Systems Ltd

Vous aimerez peut-être aussi