04 Alan Gates Slides 161207150943

© Hortonworks Inc. 2011 – 2016.
All Rights Reserved

The Enterprise and Connected Data,
Trends in the Apache Hadoop
Ecosystem
Alan Gates
Co-Founder
Hortonworks
@alanfgates
Our Hadoop Journey Begins…
2006
Batch apps
MapReduce
HDFS
1 ° ° °
° ° ° N
© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Our Hadoop Journey: Ecosystem Innovation Accelerates
2006 2011 Today

6 Years of Apache Hive and Beyond
A SQL data warehouse infrastructure that • Extensive SQL:2011 Support
delivers fast, scalable SQL processing on • Compatible with every major BI Tool
Hadoop and in the Cloud • Proven at 300+ PB Scale
• ACID transactions introduced
• Apache Hive becomes a Top-Level Project • Apache Tez enters incubation • Governance added with Apache
Atlas integration
2010 2011 2012 2013 2014 2015 2016
• Hive 0.13 marks delivery of the Stinger

• HiveServer2 adds ODBC/JDBC Initiative with Tez, Vectorized Query • Hive 2 introduces LLAP and
• SQL breadth expands with windowing and ORCFile support intelligent in-memory caching
and more • Standard SQL authorization,
integration with Apache Ranger

Hive 2 with LLAP: Architecture Overview
YARN Cluster
Query
Coordinators LLAP Daemon LLAP Daemon LLAP Daemon LLAP Daemon
ODBC /
JDBC SQL Coord-
Queries inator Query Query Query Query
HiveServer2 Executors Executors Executors Executors
Coord-
(Query
inator
Endpoint)
In-Memory In-Memory In-Memory In-Memory
Coord-
inator
Cache Cache Cache Cache
Storage
Deep
S3 + Other HDFS
HDFS
Compatible Filesystems

Hive 2 with LLAP: 25+x Performance Boost
Hive 2 with LLAP averages 26x faster than Hive 1
250 50
Query Time(s) (Lower is Better) 45

200 40
35
Speedup (x Factor)
150 30
25
100 20
15
50 10
5
0 0
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

What’s new in Spark 2.0?
 API Improvements
– SparkSession – new entry point
– Unified DataFrame & DataSet API
– Structured Streaming/Continuous Application
 Performance Improvements
– Tungsten Phase 2 – Whole-stage code generation
 ML
– ML model persistence
– Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)
 SparkSQL
– SQL 2003 support (new ANSI SQL parser, subquery support)

How to Secure and Govern Access to Your Data?
Streams
Policies
?
Classification
Entities
in Data
Lake Prohibition
Pipelines
HDFS HBase Time

Files Tables
Hive
Tables Location
Feeds

Secure and Govern Your Data with Tag-Based Access Policies
Streams
Track Metadata Policies
and Lineage
PDP Classification
Entities Atlas Resource
in Data Cache
Lake Prohibition
Metastore
Pipelines Tags Ranger
Assets
HDFS HBase Entitles Time
Files Tables Atlas Client
Subscribers
Hive to Topic
Tables Gets Metadata
Updates Location
Feeds Manage Access Policies

and Audit Logs

Data In Motion
REGIONAL CORE
SOURCES
INFRASTRUCTURE INFRASTRUCTURE
 Constrained  Hybrid – cloud/on-premises

 High-latency  Low-latency
 Localized context  Global context

Our Hadoop Journey: From the Data Center to the Cloud!
2006 Today

Why Hadoop in the Cloud?
IT & No Upfront Ephemeral & Unlimited

Business Agility HW Costs Long-Running Elastic Scale

Key Architectural Considerations for Hadoop in the Cloud
On-Demand Elastic Resource

Ephemeral Workloads Management
10101
10101010101
01010101010101
Shared Metadata, Shared Data 0101010101010101010
Security & Governance & Storage

Shared Data and Storage
Understand and Leverage Unique Cloud Properties

 Shared data lake is cloud storage accessible
by all apps 10101
 Cloud storage segregated from compute 10101010101
 Built-in geo-distribution and DR 01010101010101
0101010101010101010
Focus Areas
 Address cloud storage consistency
and performance Shared Data
 Enhance performance via memory
& Storage
and local storage

Enhance Performance via Caching
Tabular Data: LLAP Read + Write-thru Cache

Workloads
 Shared across jobs / apps and across engines
 Cache only the needed columns Cache
 Spills to SSD when memory is full (anti-caching)
 Read & Write-through cache
 Security: Column-level and row-level
HDFS Caching for Non-tabular Data

HDFS Files LLAP R/W Tables
 Cache data from cloud storage as needed
 Write-through cache Cloud Storage

Prescriptive On-Demand Ephemeral Workloads
Data Science Warehouse
Compute Fabric R/W Tables Compute Fabric R/W Tables
Compute Fabric R/W Tables Compute Fabric R/W Tables
On-Demand
Ephemeral
Workloads
ETL Search

Shared Data Requires Shared Metadata, Security, and Governance
Shared Metadata Across All Workloads

 Metadata considerations Streams
Policies
– Tabular data metastore
Classification
– Lineage and provenance metadata
Shared
– Pipeline and job management metadata Metadata
Prohibition
– Add upon ingest Pipelines
– Update as processing modifies data Files Objects Time
 Access / tag-based policies and audit logs Tables

Location
 Centrally stored to facilitate use across clusters
– Ex. backed by Cloud RDS (or shared DB) Feeds

Elastic Resource Management in Context of Workload
Workload Management vs. Cluster Management

 Understand resource needs of different
workload types
 Add / remove resources to meet workload SLAs
 Manage compute power and high-performance
data-access (ex., LLAP)
 Pricing-aware: instances (spot, reserved), Elastic
data, bandwidth
Resource
Management

Transformational Applications Require Connected Data
Edge Machine
Analytics Learning
CLOUD Edge Data at

Data Data in
Motion Rest
Stream Analytics
Data in Data at Edge

D ATA C E N T E R Motion Rest Data
Deep Historical
Analysis

Thank You

04 Alan Gates Slides 161207150943

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

04 Alan Gates Slides 161207150943

Transféré par

Droits d'auteur :

Formats disponibles

© Hortonworks Inc. 2011 – 2016.

All Rights Reserved

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

2010 2011 2012 2013 2014 2015 2016

• Hive 0.13 marks delivery of the Stinger

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Query Time(s) (Lower is Better) 45

Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDFS HBase Time

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Feeds Manage Access Policies

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

 Constrained  Hybrid – cloud/on-premises

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

IT & No Upfront Ephemeral & Unlimited

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

On-Demand Elastic Resource

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Understand and Leverage Unique Cloud Properties

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Tabular Data: LLAP Read + Write-thru Cache

HDFS Caching for Non-tabular Data

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Compute Fabric R/W Tables Compute Fabric R/W Tables

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Shared Metadata Across All Workloads

 Access / tag-based policies and audit logs Tables

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Workload Management vs. Cluster Management

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

CLOUD Edge Data at

Data in Data at Edge

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Vous aimerez peut-être aussi