Vous êtes sur la page 1sur 25

Introduction to Hbase

By: Venkat
Problems with treditional RDBMS
Stores relational data
Works well for a limited number of records
SQL JOINs becomes a bottleneck
Storage space for NULL values
Static Schema
Traditional RDBMS Table - Employee
Column Oriented Database - Dynamic Columns
Problems with HIVE
No DML Operations
Query performance on very large Datasets
Static Schema
Versions
What is HBase?
Built on top of Hadoop
Distributed uses HDFS for storage
Column Oriented Database
Multi-Dimensional(Versions)
Read / Write access to data on HDFS
Storage System
What HBase is NOT?
A SQL Database - No JOINs, No Query Engine,
No Datatypes, No SQL
No Schema
No DBA needed
HBase vs RDBMS
RDBMS HBase
Data Layout Row-oriented Column family oriented

Query Language SQL Get/put/scan/etc *

Unix Level of Security &


Security Authentication/Authorization
Namespace ACLs

Max Data Size TBs Hundrends of PBs

Read / Write throughput limit 1000s queries/second Millions of queries/second


HDFS vs HBase
HDFS is a distributed file system that is well suited
for the storage of large files. HBase, on the other
hand, is built on top of HDFS and provides fast
record lookups (and updates) for large tables.
HBase has based on Google BIG Table concept
HBase building blocks
Table is a collection of rows, each which has a Primary Key (Row Key)
Row is a collection of column families
Column family is a collection of columns
Column is a collection of key value pairs (Cells)
Each Cell Value contains a timestamp
HBase Schema Samples :
HBase Architecture
HBase Components
ZooKeeper
Provides services like maintaining configuration information, naming,
providing distributed synchronization, etc.
Zookeeper has Z-Nodes representing different region servers.
Master servers use these nodes to discover available servers and to track server failures.
Clients communicate with region servers via zookeeper
In pseudo and standalone modes, HBase itself will take care of zookeeper (Internal)
HBase Components
HMaster
Responsible for schema changes and other metadata (.META) operations such as
creation of tables and column families
Assigns regions to the region servers and takes the help of ZooKeeper for this task
Handles load balancing of the regions across region servers
Unloads the busy Regions Servers and shifts the regions to less occupied Region servers
HBase Components
HRegionServer
The Region Servers have regions that -
Communicate with the client and handle data-related operations
Handle read and write requests for all the regions under it
Decide the size of the region by following the region size thresholds
Auto Sharding
Data Distribution
Data Distribution
A Big Map Table
Row Key + Column Key + timestamp => Value
Row Key Column Key Timestamp Value

1 Info:name 1273516197868 Sakis

1 Info:age 1273871824184 21

1 Info:gender 1273746281432 Male

2 Info:name 1273863723227 Themis

2 Info:name 1273973134238 Andreas


HBase Namespace

HBase Namespace provide users with a project spacce in which they can create and
manage their own tables.

Table: All tables are members of a namespace, tables with no explicit namespace will
be a member of the default namespace. A table can only be a member of a single
namespace and once defined is permanent.
RSG: A Namespace may optionally have a default region server group. All the tables
created in the Namespace will be members of a namespaces region server group.
A namespace can reference only one region server group afterwhich can no longer
be referenced by other namespaces. This can only be set during namespace creation.
Permissions: A Namespace can have ACLs defined. Write access granted
to a namespace will permit table creation for the given Namespace
This provides tenants their own domain of administration within HBase Cluster

Quota: Quota provides some level of control required to insure that shared resources
are allocated fairly. As a first step we only intend to limit the number of tables and
regions a given namespace may contain.

CLI Commands :
Ways to access HBase Data
HBase Shell
Thift Server
REST Clients
Hadoop ecosystem clients (Hive, Pig, HCatalog...etc)
Hive vs HBase
Hive HBase
Hive is an SQL-like engine that runs HBase is a NoSQL key/value database
MapReduce jobs on Hadoop
Hive can be used for analytical querying HBase can be used for real-time querying
like data collected over a period of time
Provides data summarization and ad-hoc Supports data storage for large tables
querying

Data can even be read and written from Not possible in HBase
Hive to HBase and back again
Where to use HBase (Use Cases)?
To have random, real-time read/write access to Big Data
Fast random access to available data
Variable schema where each row is slightly different
Loading, searching, querying data by Row Key
Retrieve small set of data from billions of records
Where to not to use HBase?
If you plan to scan to entire HBase table or majority of it

If you are not using a filter against rowkey column in your query
Use of "LIKE" against rowkey column does not result good
When creating external tables in Hive against HBase tables,
map the HBase rowkey against a string column in Hive.
If this is not done, rowkey is not used in the query and entire
table is scanned.

Vous aimerez peut-être aussi