Académique Documents
Professionnel Documents
Culture Documents
Richard McDougall @richardmcdougll CTO, Application Infrastructure, Big Data Lead, VMware, Inc
Build
Private Public
Run
Manage
Simplify Data
audio digital tv digital photos camera phones, rfid medical imaging, sensors satellite images, games, scanners, twitter cad/cam, appliances, videoconfercing, digital movies
Value from the intelligence of data analytics now outstrips the cost
of hardware Hadoop enables the use of 10x lower cost hardware Hardware cost halving every 18mo
Value
Real-Time Processing
(s4, storm)
Big SQL
(Greenplum, AsterData, Etc)
Batch Processing
Scale of Cluster
100s
File System:
100s PB
1,000s
Yes
Hadoop
Big-SQL:
PBs
100s
No
Cassandra, hBase,
In-Memory:
100s
Future
10s-100s
Hybrid Possible
Analytics Tools
Developer Frameworks
HDFS Greenplum
Database/DataStore
Data Platform
Data PaaS
vSphere
Cloud Infrastructure
Private Public
10
Goals
Make it fast and easy to provision new data Clusters on Demand Allow Mixing of Workloads
Leveraging Virtualization
Elastic scale Use high-availability to protect key services, e.g., Hadoops namenode/job
tracker
11
Simplify
Single Hardware Infrastructure Faster/Easier provisioning
SQLCluster
Big SQL
NoSQL Cluster
NoSQL
Hadoop
Public
Optimize
Shared Resources = higher utilization
Decision Support Cluster
12
Local Storage $0.05/Gigabyte $1M gets: 20 Petabytes 10,000,000 IOPS 800 Gbytes/sec
14
Hybrid Storage
SAN for boot images, VMs, other
workloads
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Host
Host
Host
Host
Host
Host
15
Hadoop
Ratio to Native
0.8
0.2
Filesystem
LargeScale NoSQL
InMemory
Big SQL
Cloud Infrastructure
Management
Data Discovery
Cloud Infrastructure
17
Existing Applications
New Applications
Provisioning
Clone
Resource Mgmt
Security Mgmt
Database Templates
Monitor
VMware vSphere
18
Unstructured
Structured
Filesystem
Cloud Infrastructure
LargeScale NoSQL
InMemory
Big SQL
19
Filesystem
LargeScale NoSQL
InMemory
Big SQL
Types of Data
Log files, machine generated data, documents, device data, etc NAS, HDFS, Blob (S3, Atmos, etc..) Store any data, easy to scale-out, can optimize for cost
Loosely typed device data, records, events, statistics, complex relations/graphs Cassandra, hBase, Voldemort Easy to scale-out, flexible and dynamic schemas
Structured data
Technologies
Greenplum, Sybase IQ, Aster Data, etc,. High performance for repetitive queries. Ease of query language.
Values
20
Cloud Infrastructure
Platform as a Service
21
NoSQL Integration
Spring data for MongoDB, Gemfire, Riak, Neo4j, Blob, Cassandra
Spring Hadoop
Announced this week at Strata!
Provides support for developing applications based on Hadoop technologies
by leveraging the capabilities of the Spring ecosystem.
Spring Batch
Integration allows Hadoop jobs and HDFS operations as part of workflow
22
Analytics Tools
Developer Frameworks
HDFS Greenplum
Database/DataStore
Data Platform
Data PaaS
vSphere
Cloud Infrastructure
Private Public
23
Summary
Hadoop on Virtualization
Proven performance
Cloud/Virtualization values apparent for Hadoop use
24
References
Twitter
@richardmcdougll
My CTO Blog
http://communities.vmware.com/community/vmtn/cto/cloud
Hadoop on vSphere
Talk @ Hadoop World Performance Paper http://www.vmware.com/files/.../VMW-Hadoop-Performance-vSphere5.pdf
Spring Hadoop
http://blog.springsource.org/2012/02/29/introducing-spring-hadoop
25