Vous êtes sur la page 1sur 62

History of Hive

At Facebook the data grew from GBs to 1 TB/day and today it is


500+ TBs per day

Rapid growing data made traditional warehousing expensive

Scaling up vertically is very expensive

Hadoop is an alternative to store and process large data

MR is very low level and requires custom code

Facebook developed Hive as a solution

Hive becomes a Hadoop subproject

Apache continued the development of Hive

| 2014, Cognizant
What is Hive?

Hive is a Data Warehouse solution built on Hadoop

Hive is SQL on Hadoop

Hadoop Ecosystem component for querying, managing


and storing structured data on Hadoop

A system for data summarization and analysis

Provides an SQL dialect called Hive QL to process data


on Hadoop cluster

Hive Translates the Hive QL queries into Map reduce jobs

2 | 2014, Cognizant
What is Hive?

Hive is not a full database

Hive does not provide record level insert update delete

Hive do not provide transactions

Hive queries have higher latencies even for small data

Hive provides Flexible schema

Hive supports custom map reduce program to be plugged

3 | 2014, Cognizant
Why Hive?

Map Reduce programming is suitable for experienced Java


programmers

Hive provides the familiar programming model like SQL

Eliminates the need for writing the complex Java code

Hive queries and scripts are simple

4 | 2014, Cognizant
Hive Architecture

Driver - Lifecycle of Hive QL for


compilation, optimization and
execution.

Metastore System catalog


which contains metadata about
table schemas and other system
schemas. Stores in separate DB
like MySQL

Thrift Server Allows clients to


access Hive using languages like
C++, Java etc. Optional thrift
servers are Hive Server or Hive
Thrift

5 | 2014, Cognizant
Hive Components

Serializers/ Deserializers (SerDe)

To read and write data in and out of Hive tables


Support for Custom SerDe

Metastore
Implements metastore server

Query Processor
Processing framework implementation for translation of HQL
Into Map reduce jobs

6 | 2014, Cognizant
Hive SerDe

When a record is read into Hive, the Input Format is used to


read that record of data (whether its a string or binary). The
record gets passed to the Deserializer which then translates
the record into a Java object that Hive can manipulate.

When Hive writes data out, it uses a Serializer to convert the


record to the format type expected by the Output Format. In
other words, the Serializer takes the Java object that Hive
has been working with and turns it into something that Hive
can write out

7 | 2014, Cognizant
Hive SerDe

Hives default SerDe is the LazySimpleSerDe. This SerDe is


considered Lazy because values are only read (i.e.
Deserialized) if accessed by the query being run

Hive also has several built-in SerDes including a regular


expression SerDe and a Hive Binary format SerDe

You can also implement your own custom SerDe. Additionally


there is a variety of 3rd party SerDes that are available
including ones that handle data in JSON and other formats.

8 | 2014, Cognizant
Hive Metastore

Hive operates in 3 metastore modes

Embedded (Default)

Local

Remote

| 2014, Cognizant
Hive Metastore - Embedded

Mainly used for unit tests

Only one process is allowed to connect to the metastore at a time

Hive metadata is stored in an embedded Apache Derby database

10 | 2014, Cognizant
Hive Metastore - Local

Hive metastore keeps running in the same process as Hive service

Moves the datastore into a separate process that metastore can


communicate with. Eg. MySQL

Hive client will open the connection to the datastore and make
queries against it

11 | 2014, Cognizant
Hive Metastore - Remote

Hive metastore moves out of the process of Hive service

All Hive clients will make a connection to the metastore server and
server queries the datastore for metadata

Metastore and clients communicate via thrift protocol

Used in production environments

12 | 2014, Cognizant
13 | 2014, Cognizant
Hive Workflow

14 | 2014, Cognizant
Hive Execution Workflow

15 | 2014, Cognizant
Hive Connection

Hive can be connected as:

Hive command line interface (CLI)


Hive Server 2 (JDBC, ODBC, Beeline)
Beeswax (Hue)

16 | 2014, Cognizant
Command Line Interface

Set class path property

Export HIVE_HOME=<Hive install directory>

Run hive from bin folder in Hive home directory

$HIVE_HOME/bin/hive

17 | 2014, Cognizant
Hive Server 2

Hive Server 2 (introduced in Hive 0.11) has its own advantages.

Hive CLI is deprecated in favor of beeline as it lacks multiuser,


security and other capabilities

Hive server 2 can be started as

$HIVE_HOME/bin/hiveserver2

18 | 2014, Cognizant
Hive Server 2

19 | 2014, Cognizant
Beeline

Beeline is started with the JDBC url of the hive server 2 which
depends on the port where Hive server 2 was started

Starting the beeline client

$HIVE_HOME/bin/beeline u jdbc:hive2://localhost:10000

20 | 2014, Cognizant
Beeswax

Beeswax is a user interface to Hive data warehouse system.

Beeswax provides a GUI to hive

21 | 2014, Cognizant
Hive Data Model

Hive organizes data similar to database concepts like table,


columns and rows

22 | 2014, Cognizant
Data Unit

Databases Namespace to separate table and other data

Tables Homogenous collection of data having same schema

Partitions Division in table data based on key value

Buckets Division in partitions based on hash value of column

23 | 2014, Cognizant
Data Types

Supports primitive data types and three collection types

tinyint arrays
smalint structs
int maps
bigint union
boolean
string
float
binary
double
timestamp

24 | 2014, Cognizant
Data Types

address struct <city:STRING; state:STRING>

struct (Chicago,IL) address.city = Chicago

name map(first,Adams,last,John)

name[first] -> Adams

names array(Hello,World);

Name[1] = World

25 | 2014, Cognizant
File Format

File Format specifies how records are encoded in files

Default file format is TEXTFILE

Hive uses different control characters as delimiters in text files


^A(octal 001), ^B(octal 002), ^C(octal 003), \n

Support text files csv, tsv

26 | 2014, Cognizant
File Format

TextFile
Each record is a line in the file
Suitable for sharing data with other tools
Can be viewed and edited manually
SequenceFile
Flat files that stores binary key, value pair
Supports Uncompressed, Record compressed and Block
compressed formats
RCFile
Stores columns of a table in a record columnar way
ORC File
Optimized Row Columnar file format

27 | 2014, Cognizant
File Format

28 | 2014, Cognizant
Hive Query Language

Simple SQL Query like language

Subset of SQL

Support custom map reduce scripts

Support for user defined functions

No Support for row level insert/update/delete

29 | 2014, Cognizant
Hive Databases

Hive Databases are like namespaces/catalogs

If no database is specified, default database is used

Hive creates a directory for each database it creates

Default directory is specified by the property hive.metastore.warehouse.dir

Different directory can be specified using LOCATION option in the


CREATE command

30 | 2014, Cognizant
Hive Database Commands

Creating Database
CREATE DATABASE [IF NOT EXISTS] mydb1;

Listing Databases
SHOW DATABASES;

Describing database
DESCRIBE DATABASE mydb1;
DESCRIBE DATABASE EXTENDED mydb1;

Using Database
USE mydb1;

31 | 2014, Cognizant
Hive Database Commands

Dropping Database
DROP DATABASE [IF EXISTS] mydb1;
DROP DATABASE [IF EXISTS] mydb1 CASCADE;

Altering Databases
SHOW DATABASES;

Describing database
ALTER DATABASE mydb1 Set DBPROPERTIES(edited-
by = CTS);

32 | 2014, Cognizant
Hive Tables

Hive tables are equivalent to tables in relational database

Each table has a HDFS directory

Data is stored in files within directory

Hive Supports 2 types of tables

Managed tables
External tables

33 | 2014, Cognizant
Managed Table

Lifecycle of data and metadata is controlled by the Hive

Data is stored under the sub directory defined by


hive.metastore.warehouse.dir

When the table is dropped, both data and metadata is deleted

Not a good choice for sharing the data with other tools/apps.

34 | 2014, Cognizant
External Table

Lifecycle of data is not controlled by the Hive

Data is stored under the directory defined by LOCATION keyword


in the CREATE table command

When the table is dropped, data is not deleted and only metadata
is deleted

Better choice for sharing the data with other tools/apps.

35 | 2014, Cognizant
Table Commands

CREATE [EXTERNAL] TABLE IF NOT EXISTS mydb.employees (


name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,city STRING,zip:INT>,
COMMENT Employee table
ROW FORMAT
DELIMITED FIELDS TERINATED BY ,
COLLECTION ITEMS TERMINATED BY :
MAP KEYS TERMINATED BY |
LOCATION /user/hive/warehouse/mydb.db/employees;

36 | 2014, Cognizant
Table Commands

Describe Table

DESCRIBE employees
DESCRIBE employees.name

Alter Table
ALTER TABLE employees ADD COLUMN (lastname STRING)

Drop Table
DROP TABLE employees;

37 | 2014, Cognizant
Table Commands

Add Partition

ALTER TABLE log_data ADD PARTITION (year=2013)


LOCATION /logs/2013

Change Column name,type,position


ALTER TABLE employees CHANGE COLUMN salary
emp_salary INT

38 | 2014, Cognizant
Hive Partitions

Partitioning a table is a way of dividing the table into multiple parts


based on some entity value

Improves query performance

Stored as subdirectories in the table directory

Hive supports 2 types of partition schemes


Static partitions
Dynamic partitions

39 | 2014, Cognizant
Hive Partitions

Static Partition

Partition name need to be specified while data loading

Dynamic Partition
Partition names are determined based on the partition column
values
set hive.exec.dynamic.partition= true
set hive.exec.dynamic.partition.mode=nonstrict

Both static and dynamic can be used together.

40 | 2014, Cognizant
41 | 2014, Cognizant
42 | 2014, Cognizant
43 | 2014, Cognizant
Hive Bucketing

Bucketing decomposes data sets into more manageable parts

Data in each partition is divided into individual buckets constructed


on hash of a column in the table

Every bucket is stored as separate file in the partition directory

Users can specify the number of buckets for their data set

Number of buckets doesnt vary with data

Good for joins and sampling

For bucketing : hive.enforce.bucketing should be true

44 | 2014, Cognizant
Hive Bucketing

CREATE TABLE student (


name STRING,
rollno INT,
score INT,
grade STRING,
assessment_date TIMESTAMP)
ROW FORMAT
DELIMITED FIELDS TERINATED BY \t
CLUSTERED BY (score) INTO 4 buckets

45 | 2014, Cognizant
Hive DML Data Load

Hive supports Data load from Local File System and HDFS

From Local File System


LOAD DATA LOCAL INPATH /home/localdata/emp.txt
OVERWRITE INTO TABLE employee

From HDFS
LOAD DATA INPATH /user/data/emp.txt INTO TABLE employee

Schema on Read
On load Hive do not check that the files in the table directory
confirms to the table schema
Any mismatch will result in error only at the time of querying

46 | 2014, Cognizant
Hive DML Data Load

Hive also supports data loading from one table into another table or
at time of table creation

INSERT OVERWRITE TABLE employee_name


SELECT name FROM employee;

CREATE TABLE employee_name AS


SELECT name from employee;

47 | 2014, Cognizant
Hive DML Select

HQL allows writing the SQL queries to retrieve data using the
SELECT statement

Number of rows can be controlled using LIMIT clause

Provides support for regular expressions, GROUP BY, HAVING

WHERE clause supports all operators (>,>=,LIKE,IS NULL etc.)

Examples:
SELECT * FROM employee;
SELECT * FROM employee LIMIT 5;
SELECT * FROM employee where name LIKE M%;

48 | 2014, Cognizant
Hive DML Select

GROUP BY
HAVING

SELECT count(name), sum (salary),emploc


FROM employee
GROUP BY emploc
HAVING count(name) <4;

Supports
ORDER BY
SORT BY
DISTRIBUTE BY

49 | 2014, Cognizant
Hive DML Select

ORDER BY
Performs total ordering of the query result set
Results in long running queries

SORT BY
Arrange data in each reducer in ASC or DESC order
Each reducer output is sorted

DISTRIBUTE BY
Used in association with SORT BY
Helps in controlling how map output can be divided among
reducers

50 | 2014, Cognizant
Hive DML JOINS

Hive supports equi joins and no support for non- equi joins

INNER JOIN
LEFT/RIGHT/FULL OUTER JOIN
LEFT SEMI JOIN
CROSS JOIN

Separate Map Reduce job is used for each pair of join

When joining 3 or more tables, if every join condition uses the same
key, single Map Reduce job will be used

In every MR join , the last table is streamed into the reducer while
rest tables are buffered(in memory).

51 | 2014, Cognizant
Hive DML JOINS

Examples

SELECT e.empname, a.score,c.certificationname FROM employee e


JOIN assessment a ON (e.empno = a.empno) JOIN certification c ON
(a.certid = c.certid);

SELECT c.certname, a.attempt FROM certification c LEFT OUTER


JOIN assessment a ON (c.certid = a.certid);

SELECT e.empname FROM employee e LEFT SEMI JOIN


assessment a ON (a.empno = e.empno) and a.attempt = 1

52 | 2014, Cognizant
Hive DML JOINS

Join Types

MAP SIDE JOIN

BUCKETED MAP JOIN


set hive.optimize.bucketmapjoin = true;

SORT MERGE BUCKET MAPJOIN


set hive.optimize.bucketmapjoin = true;
hive.optimize.bucketmapjoin.sortedmerge = true;

53 | 2014, Cognizant
Hive DML VIEWS

Hive views are purely logical object with no associated storage

Examples
Creating view

CREATE VIEW emp_view as


SELECT name,salary FROM employee

Selecting from view

SELECT * FROM emp_view

54 | 2014, Cognizant
Hive DML INDEX

Hive has limited support for indexes

Indexing comes at cost of storage and processing time

Index can be built on top of few columns to speed up operations

No concept of foreign/primary key

Examples
CREATE INDEX emp_idx ON
TABLE employee(name)
AS COMPACT
WITH DEFERRED REBUILD

55 | 2014, Cognizant
Hive Functions

Hive provides support for below functions

UDF(User defined functions)

UDAF(User defined aggregate functions)

UDTF(User defined table generating functions)

Custom Map Reduce

56 | 2014, Cognizant
UDF

User Defined Functions

UDF operates on a single row and produces a single row as


output

Example:
o Mathematical round ,floor,ceil
o String ascii, concat, lower, upper
o Date year , month

Custom UDF can be created which must be a sub class of


org.apache.hadoop.hive.ql.exec.UDF

57 | 2014, Cognizant
UDAF

User Defined Aggregate Functions

UDF operates on multiple rows and produces a single row as


output

Example:
o Count, max, min, sum

Custom UDAF can be created which must be a sub class of


org.apache.hadoop.hive.ql.exec.UDAF

58 | 2014, Cognizant
UDTF

User Defined Table Generating Functions

UDTF operates on a single row and produces a multiple rows as


output

Example:
o Explode takes array as an input and produces each array
element as separate row

Custom UDTF can be created which must be a sub class of


org.apache.hadoop.hive.ql.exec.UDTF

59 | 2014, Cognizant
Custom Map Reduce

Hive provides support for writing the custom mapper and reducer to
override the default map reduce generated by hive

Example

FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
CLUSTER BY key) map_output
INSERT OVERWRITE TABLE
pv_users_reduced
REDUCE map_output.key, map_output.value
USING 'reduce_script'
AS date, count;

60 | 2014, Cognizant
THANK YOU

61 | 2013, Cognizant Confidential

Vous aimerez peut-être aussi