Vous êtes sur la page 1sur 42

Hadoop 101 & Why Cloudera?

- Wahyu Budiman

© Cloudera, Inc. All rights reserved. 1


What is Big Data

© Cloudera, Inc. All rights reserved. 2


“ The name my kid gave a stuffed yellow elephant.
Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere:
those are my naming criteria.
Kids are good at generating such.
Googol is a kid's term. ”
– Doug Cutting1

1 https://issues.apache.org/jira/browse/NUTCH-193

© Cloudera, Inc. All rights reserved. 3


What is Apache Hadoop?
CORE HADOOP COMPONENTS
Hadoop is a platform for data
storage and processing that is… Hadoop Distributed MapReduce
File System (HDFS)
 Scalable
 Fault tolerant File Sharing & Data
Protection Across Physical
 Open source Servers
Distributed Computing
Across Physical Servers

Flexibility Scalability Low Cost


 A single repository for storing  Scale-out architecture divides  Can be deployed on commodity
processing & analyzing any type of workloads across multiple nodes hardware
data  Flexible file system eliminates ETL  Open source platform guards
 Not bound by a single schema bottlenecks against vendor lock

© Cloudera, Inc. All rights reserved. 4


Hadoop Evolution

© Cloudera, Inc. All rights reserved. 5


Customer success across all industries
Financial
Services

Telecom

Healthcare &
Life Sciences

Media &
Technology

Retail &
CP

Public
Sector

© Cloudera, Inc. All rights reserved. 7


Industry Use Cases – Driving Business value across all industries
Real customer use cases

Manufacturing / Retail / CPG / Government /


Financial Services Telco / Media
Life Science Transportation Health care

Customer 360 Customer 360 Connected: Car, Customer Loyalty Border Control
Fraud / Cyber Churn prediction Plane, Equipment Ship to Store Risk / Intelligence
Compliance Network Agile Supply Chain Agile Supply Chain Tax Optimization
Spend Analytics Optimization Predictive Next Best Offer Cybersecurity
Operational Data Data Monetization Maintenance Connected Store Fraud prevention
Store EDW Augmentation IoT Data enabled Completed baskets Intelligence
Market Data Media Streaming “Smart Services” IoT – Stores Patient care
Algo Trading Active Archive Clinical trials Active archiving Citizen 360
Active Archive Predictive Diagnostics Smart Vessel Patient Records
maintenance SAP active archive

© Cloudera, Inc. All rights reserved. 8


The “paradigm shift” to Hadoop / Data Centric platforms
LEGACY = Data to Compute MODERN = Compute to Data

Compute Compute
Process-centric
Data businesses use:
Data
Structured data mainly
Internal data only Information-centric
Compute Data “Important” data only Compute businesses use all data:
multi-structured,
internal & external data
Data Siloed data sources of all types
Compute Compute
Data

© Cloudera, Inc. All rights reserved. 10


What is Hadoop?
• Hadoop is a distributed data storage and processing platform
• Stores massive amounts of data in a very resilient way
• Handles low-level distributed system details and enables your developers to focus
on the business problems
• Tools built around Hadoop (the “Hadoop ecosystem”) can be configured/extended to
handle many different tasks, including
• Extract Transform Load (ETL)
• BI environment
• General data storage
• Predictive analytics
• Statistical analysis
• Machine learning

© Cloudera, Inc. All rights reserved. 11


Why Hadoop vs RDBMS / EDW / Data Warehouse
The Old Way The Hadoop Way
Compute Data Storage
(RDBMS, EDW) (SAN, NAS) Compute Memory Storage
(CPU) (Disk)
z
Network
z

Expensive, Special purpose, “Reliable” Servers Industry Standard Servers


Expensive Licensed Software Hybrid Open Source Software
• Hard to scale • Scales out indefinitely
• Network inevitably becomes a bottleneck • Network eliminated as a bottleneck
• Only handles structured/relational data • Easy to ingest any type of data
• Difficult to add new fields & data types • Agile “schema-on-read” data access
Expensive & Unattainable Affordable & Attainable
$30,000+ per TB $300-$1,000 per TB

©2014 Cloudera, Inc. All rights © Cloudera, Inc. All rights reserved. 12
Data Lake, Data Cleansing, BI

© Cloudera, Inc. All rights reserved. 13


Canonical Hadoop Architecture

© Cloudera, Inc. All rights reserved. 14


Active Archive

© Cloudera, Inc. All rights reserved. 15


Offloading DW ELT workload to Hadoop

© Cloudera, Inc. All rights reserved. 16


Core Hadoop Origins: Storage and Processing

MapReduce

Hadoop Distributed File System (HDFS)

© Cloudera, Inc. All rights reserved. 21


Core Hadoop with Resource Management

MapReduce

Workload Management (YARN)

Hadoop Distributed File System (HDFS)

© Cloudera, Inc. All rights reserved. 22


Ingesting Data

MapReduce

Workload Management (YARN)

Hadoop Distributed File System (HDFS)

Streaming Event Data


RDBMS (sensors, logs, devices)

© Cloudera, Inc. All rights reserved. 23


Ingesting RDBMS Data

MapReduce

Workload Management (YARN)

Hadoop Distributed File System (HDFS)

Sqoop

Streaming Event Data


RDBMS (sensors, logs, devices)

© Cloudera, Inc. All rights reserved. 24


Ingesting Streaming Data

MapReduce

Workload Management (YARN)

Hadoop Distributed File System (HDFS)

Sqoop Spark
Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)

© Cloudera, Inc. All rights reserved. 25


Ingesting Data: File Formats and Security

MapReduce

Workload Management (YARN)

Hadoop Distributed File System (HDFS)


Avro Parquet Encryption on Disk

Sqoop Spark
Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)

© Cloudera, Inc. All rights reserved. 26


Data Processing: Beyond MapReduce

Pig Hive

MapReduce Spark

Workload (YARN)

Hadoop Distributed File System (HDFS)


Avro Parquet Encryption on Disk

Sqoop Spark
Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)

© Cloudera, Inc. All rights reserved. 27


Data Analysis and Exploration

Pig Hive

Impala Search
MapReduce Spark
(SQL) (Solr)
Workload
(YARN)
Hadoop Distributed File System (HDFS)
Avro Parquet Encryption on Disk

Solr Spark
Sqoop indexing Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)

© Cloudera, Inc. All rights reserved. 28


NoSQL Storage Option

Pig Hive

Impala Search NoSQL


MapReduce Spark
(SQL) (Solr) (HBase)
Workload
(YARN)
Hadoop Distributed File System (HDFS)
Avro Parquet Encryption on Disk

Solr Spark
Sqoop indexing Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)

© Cloudera, Inc. All rights reserved. 29


Completing the Picture: Security, Workflow, UIs
Sentry Sentry

Pig Hive

Impala Search NoSQL

Cloudera Navigator

Cloudera Manager
MapReduce Spark
(SQL) (Solr) (HBase)
Workload

Hue
Oozie

(YARN)
Hadoop Distributed File System (HDFS)
Avro Parquet Encryption on Disk

Solr Spark
Sqoop indexing Flume Kafka
Streaming
Streaming Event Data
RDBMS (sensors, logs, devices)

© Cloudera, Inc. All rights reserved. 30


Cloudera at a Glance
Founded 2008, by former employees of
Leadership team Tom Reilly (CEO), Mike Olson (Chief Strategy Officer), Amr Awadallah (CTO),
Doug Cutting (Chief Architect and Creator of Hadoop)
Employees Today 1000+ worldwide
Global Operations 20 countries plus & More than 100 24x7 global staff
The Largest Ecosystem More than 1800 Partners
Cloudera University Over 40,000 trained
Mission Critical Production level deployments across multiple industries globally
Open Source Leaders Contributors to the Apache Hadoop ecosystem of projects

© Cloudera, Inc. All rights reserved. 31


Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure

A new kind of data


platform:
• One place for unlimited data
• Unified, multi-framework data
access

Cloudera makes it:


• Fast for business
• Easy to manage
• Secure without compromise

© Cloudera, Inc. All rights reserved. 32


Why Cloudera?
Your trusted partner for getting results with enterprise Hadoop.

 Open Source Innovation Enterprise Security


No one knows Hadoop better than Cloudera. Meet compliance requirements and reduce
Cloudera leads development of enterprise risk exposure from storing sensitive data.
Hadoop and offers the best support, training, and
services. Data Governance
 Powerful Enterprise Tools Enable compliance and maximize analyst
productivity.
Cloudera extends open source Hadoop with
capabilities required by the largest enterprises.
Complete Management
 Ecosystem Deliver optimum system utilization and
Cloudera partners with industry leaders to ensure meet SLA commitments, on-premises or
Hadoop works with the platforms, tools, and in the cloud, with minimum effort.
integrators our customers rely on.
© Cloudera, Inc. All rights reserved. 33
The Most Complete Partner Ecosystem
Applications More than 1,800 partners
ensure compatibility with existing
investments, lower skill barriers,
Operational and help maximize value from
Tools your data.

Enterprise Data Hub

Process Discover Model Serve


Data
Systems Security and Administration

Unlimited Storage

Infrastructure

© Cloudera, Inc. All rights reserved. 34


Sustainable Innovation
A Hybrid Open Source Model
combining the power of open source with
the enterprise capabilities customers need.

Open Platform • Deep open source commitment


100% Open Source • 2/3 of engineering on open source
& Open Standards • 19 Hadoop ecosystem projects founded
• 90 ASF committer seats, 67 PMC seats
• Enterprise-ready extensions
• Security, governance, and system management
• Comprehensive partner integrations
• 160+ certified solutions

© Cloudera, Inc. All rights reserved. 35


Best-In-Class Support

8.9 Overall satisfaction makes Cloudera the


industry benchmark for support

95% Customers agree they benefit from Cloudera


technical support outreach

#1 Ability to solve technical issues is the top


reason to recommend Cloudera for Hadoop

© Cloudera, Inc. All rights reserved. 36


Industry-Leading Training and University Programs

Big Data professionals from Cloudera has trained over

60% 40,000
of the Fortune 100 have people on Hadoop since
attended live Cloudera 2009
training

© Cloudera, Inc. All rights reserved. 37


Source: Fortune, “Fortune 500 “ and “Global 500,” May 2012.
Our Platform

© Cloudera, Inc. All rights reserved. 38


What is Cloudera Enterprise Data Hub?

A unified platform that can collect


store, process and analyze
unlimited amounts and types of data
at linearly predictable scale powered by
Apache Hadoop and Apache Spark

© Cloudera, Inc. All rights reserved. 39


Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure

Hadoop delivers:
Process Discover Model Serve • One place for unlimited data
Batch, Stream SQL, Search Analytics, ML NoSQL
• Unified, multi-framework data access
Security, Governance, Administration

Unlimited Storage Cloudera delivers:


• Leading performance
On-Premises Public Cloud
• Easy system management
Deployment Appliances Hybrid Cloud
Flexibility Engineered Systems Private Cloud
• Compliance-ready security

© Cloudera, Inc. All rights reserved. 40


From Hadoop to an Enterprise Data Hub
Open Source CLOUDERA’S ENTERPRISE DATA HUB
Scalable

CLOUDERA NAVIGATOR
MANAGEMENT
Flexible BATCH ANALYTIC SEARCH MACHINE STREAM 3RD PARTY
Cost-Effective PROCESSING SQL ENGINE LEARNING PROCESSING APPS

DATA
MAPREDUCE IMPALA SOLR SPARK SPARK STREAMING

Managed ✖
✔ WORKLOAD MANAGEMENT YARN

STORAGE FOR ANY TYPE OF DATA


Open

MANAGEMENT
CLOUDERA MANAGER
UNIFIED, ELASTIC, RESILIENT,, SECURE SENTRY

SYSTEM
Architecture



FILESYSTEM ONLINE NOSQL
Secure and
HDFS HBASE
Governed

© Cloudera, Inc. All rights reserved. 41


The Only Complete Hadoop Management Suite
Deliver optimum system utilization and meet SLA commitments.

Cloudera Manager
Focus on the solution, not the
cluster, with the only complete,
zero-downtime administration
tool for Apache Hadoop.

Unique Capabilities:
• Unified configuration, management
and monitoring across all services
• Online installation and upgrades
• Direct connection to Cloudera Support
• 3rd Party Extensibility
© Cloudera, Inc. All rights reserved. 42
The Only Portable Cloud Experience for Hadoop
Maximize flexibility in Hadoop deployment architectures.

Cloudera Director
The first portable, self-service
solution for deploying and
managing enterprise-grade
Hadoop in the Cloud.

Unique Capabilities:
• Dynamic cluster lifecycle management
• Cloud blueprints
• Multi-cluster health visibility
• Usage reporting for billing models

© Cloudera, Inc. All rights reserved. 43


The Only Hadoop Data Governance Solution
Enable compliance and maximize analyst productivity.

Cloudera Navigator
Minimize risk and maintain
compliance with the only native
end-to-end data governance
solution for Apache Hadoop.

Unique Capabilities:
• Auditing
• Lineage
• Metadata Tagging and Discovery
• Lifecycle Management

© Cloudera, Inc. All rights reserved. 45


Adaptive Data Model Management
Improve DBA productivity through continuous optimization.

Navigator Optimizer
Instantly understand data
warehouse and Hadoop cluster
usage, and drive optimizations
to reduce cost and improve
performance.

Unique Capabilities:
• Schema and workload profiling
• Data model discovery
• Optimization guidance
• Optimization automation (future)
© Cloudera, Inc. All rights reserved. 46
The Only Comprehensively Secure Hadoop Platform
Meet compliance requirements and reduce risk exposure from storing sensitive data.

1. Perimeter Standards-based Authentication Cloudera is the leader in


Hadoop security.
Process Discover Model Serve
2. Access Unified Role-based Authorization Unique Capabilities:
• Comprehensive and Unified
Security and Administration • Secure at the core
3. Visibility Auditing & Governance • No Performance Impact
Unlimited Storage • Jointly engineered with Intel

4. Data Encryption & Key Management • Compliance-Ready


• Only distribution to pass PCI audit

© Cloudera, Inc. All rights reserved. 47


MasterCard
Cloudera: The first PCI-Certified
Hadoop Platform
Challenge: All applications, databases, or file
systems that have the potential to handle
“Data privacy and protection is a top
personal account-related data must undergo full priority for MasterCard. As we maximize
PCI certification the most advanced technologies from
partners and vendors, they must meet the
Solution: MasterCard’s Cloudera environment rigorous security standards we’ve set. With
Cloudera’s commitment to the same
fully conforms to the PCI-DSS V 2.0 security standards, we now have additional options
standards so it can host PCI datasets and in how we manage our data center.”
Gary VonderHaar
potentially integrate with other internal systems Chief Technology Officer,
Architecture
MasterCard

© Cloudera, Inc. All rights reserved. 48


Thank You!

© Cloudera, Inc. All rights reserved. 50