Vous êtes sur la page 1sur 22

© Hortonworks Inc. 2011 – 2016.

All Rights Reserved


The Enterprise and Connected Data,
Trends in the Apache Hadoop
Ecosystem
Alan Gates
Co-Founder
Hortonworks
@alanfgates
Our Hadoop Journey Begins…
2006

Batch apps

MapReduce

HDFS
1 ° ° °

° ° ° N

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Our Hadoop Journey: Ecosystem Innovation Accelerates
2006 2011 Today

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


6 Years of Apache Hive and Beyond
A SQL data warehouse infrastructure that • Extensive SQL:2011 Support
delivers fast, scalable SQL processing on • Compatible with every major BI Tool
Hadoop and in the Cloud • Proven at 300+ PB Scale
• ACID transactions introduced
• Apache Hive becomes a Top-Level Project • Apache Tez enters incubation • Governance added with Apache
Atlas integration

2010 2011 2012 2013 2014 2015 2016

• Hive 0.13 marks delivery of the Stinger


• HiveServer2 adds ODBC/JDBC Initiative with Tez, Vectorized Query • Hive 2 introduces LLAP and
• SQL breadth expands with windowing and ORCFile support intelligent in-memory caching
and more • Standard SQL authorization,
integration with Apache Ranger

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Hive 2 with LLAP: Architecture Overview

YARN Cluster
Query
Coordinators LLAP Daemon LLAP Daemon LLAP Daemon LLAP Daemon
ODBC /
JDBC SQL Coord-
Queries inator Query Query Query Query
HiveServer2 Executors Executors Executors Executors
Coord-
(Query
inator
Endpoint)
In-Memory In-Memory In-Memory In-Memory
Coord-
inator
Cache Cache Cache Cache
Storage
Deep

S3 + Other HDFS
HDFS
Compatible Filesystems

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Hive 2 with LLAP: 25+x Performance Boost
Hive 2 with LLAP averages 26x faster than Hive 1
250 50

Query Time(s) (Lower is Better) 45


200 40
35

Speedup (x Factor)
150 30
25
100 20
15
50 10
5
0 0

Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


What’s new in Spark 2.0?
 API Improvements
– SparkSession – new entry point
– Unified DataFrame & DataSet API
– Structured Streaming/Continuous Application
 Performance Improvements
– Tungsten Phase 2 – Whole-stage code generation
 ML
– ML model persistence
– Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)
 SparkSQL
– SQL 2003 support (new ANSI SQL parser, subquery support)

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


© Hortonworks Inc. 2011 – 2016. All Rights Reserved
How to Secure and Govern Access to Your Data?

Streams
Policies

?
Classification
Entities
in Data
Lake Prohibition
Pipelines

HDFS HBase Time


Files Tables
Hive
Tables Location

Feeds

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Secure and Govern Your Data with Tag-Based Access Policies

Streams
Track Metadata Policies
and Lineage
PDP Classification
Entities Atlas Resource
in Data Cache
Lake Prohibition
Metastore
Pipelines Tags Ranger
Assets
HDFS HBase Entitles Time
Files Tables Atlas Client
Subscribers
Hive to Topic
Tables Gets Metadata
Updates Location

Feeds Manage Access Policies


and Audit Logs

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Data In Motion
REGIONAL CORE
SOURCES
INFRASTRUCTURE INFRASTRUCTURE

 Constrained  Hybrid – cloud/on-premises


 High-latency  Low-latency
 Localized context  Global context

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Our Hadoop Journey: From the Data Center to the Cloud!
2006 Today

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Why Hadoop in the Cloud?

IT & No Upfront Ephemeral & Unlimited


Business Agility HW Costs Long-Running Elastic Scale

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Key Architectural Considerations for Hadoop in the Cloud

On-Demand Elastic Resource


Ephemeral Workloads Management

10101
10101010101
01010101010101
Shared Metadata, Shared Data 0101010101010101010
Security & Governance & Storage

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Shared Data and Storage

Understand and Leverage Unique Cloud Properties


 Shared data lake is cloud storage accessible
by all apps 10101
 Cloud storage segregated from compute 10101010101
 Built-in geo-distribution and DR 01010101010101
0101010101010101010
Focus Areas
 Address cloud storage consistency
and performance Shared Data
 Enhance performance via memory
& Storage
and local storage

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Enhance Performance via Caching

Tabular Data: LLAP Read + Write-thru Cache


Workloads
 Shared across jobs / apps and across engines
 Cache only the needed columns Cache
 Spills to SSD when memory is full (anti-caching)
 Read & Write-through cache
 Security: Column-level and row-level

HDFS Caching for Non-tabular Data


HDFS Files LLAP R/W Tables
 Cache data from cloud storage as needed
 Write-through cache Cloud Storage

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Prescriptive On-Demand Ephemeral Workloads
Data Science Warehouse
Compute Fabric R/W Tables Compute Fabric R/W Tables

Compute Fabric R/W Tables Compute Fabric R/W Tables

On-Demand
Ephemeral
Workloads
ETL Search

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Shared Data Requires Shared Metadata, Security, and Governance

Shared Metadata Across All Workloads


 Metadata considerations Streams
Policies
– Tabular data metastore
Classification
– Lineage and provenance metadata
Shared
– Pipeline and job management metadata Metadata
Prohibition
– Add upon ingest Pipelines
– Update as processing modifies data Files Objects Time

 Access / tag-based policies and audit logs Tables


Location
 Centrally stored to facilitate use across clusters
– Ex. backed by Cloud RDS (or shared DB) Feeds

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Elastic Resource Management in Context of Workload

Workload Management vs. Cluster Management


 Understand resource needs of different
workload types
 Add / remove resources to meet workload SLAs
 Manage compute power and high-performance
data-access (ex., LLAP)
 Pricing-aware: instances (spot, reserved), Elastic
data, bandwidth
Resource
Management

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Transformational Applications Require Connected Data
Edge Machine
Analytics Learning

CLOUD Edge Data at


Data Data in
Motion Rest

Stream Analytics

Data in Data at Edge


D ATA C E N T E R Motion Rest Data

Deep Historical
Analysis

© Hortonworks Inc. 2011 – 2016. All Rights Reserved


Thank You

Vous aimerez peut-être aussi