Vous êtes sur la page 1sur 89

Building a Business on Open Source

Distributed Computing

company: www.visibletechnologies.com

blog: www.roadtofailure.com
twitter: @lusciouspear

Sunday, December 20, 2009


Social Media and Scaling

Sunday, December 20, 2009


Social Media and Scaling

•Scalability Matters Now.

Sunday, December 20, 2009


Social Media and Scaling

•Scalability Matters Now.


•SM produces large, complex data

Sunday, December 20, 2009


Social Media and Scaling

•Scalability Matters Now.


•SM produces large, complex data
•Anyone can collect the web

Sunday, December 20, 2009


Social Media and Scaling

•Scalability Matters Now.


•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days

Sunday, December 20, 2009


Social Media and Scaling

•Scalability Matters Now.


•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days
•Easy to get TBs of data

Sunday, December 20, 2009


Social Media and Scaling

•Scalability Matters Now.


•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days
•Easy to get TBs of data
•Big Data enabling new fields for
companies

Sunday, December 20, 2009


What Visible Does

Sunday, December 20, 2009


What Visible Does

•BI and Brand Management on Social


Media

Sunday, December 20, 2009


What Visible Does

•BI and Brand Management on Social


Media

•Listen, Monitor, Engage

Sunday, December 20, 2009


Sunday, December 20, 2009
Sunday, December 20, 2009
Sunday, December 20, 2009
Old Product: RDBMS

Sunday, December 20, 2009


Old Product: RDBMS

•A few MSSQL servers on boxes

Sunday, December 20, 2009


Old Product: RDBMS

•A few MSSQL servers on boxes


•Lots of ETL

Sunday, December 20, 2009


Old Product: RDBMS

•A few MSSQL servers on boxes


•Lots of ETL
•Several TB, inserts slow, deletes
impossible, random fail

Sunday, December 20, 2009


Why RDBMS Bad

Sunday, December 20, 2009


Why RDBMS Bad
•Nonlinear scale cost

Sunday, December 20, 2009


Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction

Sunday, December 20, 2009


Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count

Sunday, December 20, 2009


Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-Out ones ‘meh’

Sunday, December 20, 2009


Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-Out ones ‘meh’
•Impedance Mismatch - Try to be High-
Throughput, Low-Latency

Sunday, December 20, 2009


Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-Out ones ‘meh’
•Impedance Mismatch - Try to be High-
Throughput, Low-Latency

•Swiss-army knife, unstable,


transactions, advanced SQL, tuning
Sunday, December 20, 2009
Why OSS?

Sunday, December 20, 2009


Why OSS?

•Previously all MS

Sunday, December 20, 2009


Why OSS?

•Previously all MS
•It exists!

Sunday, December 20, 2009


Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No

Sunday, December 20, 2009


Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No
•Can’t build a platform without source

Sunday, December 20, 2009


Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No
•Can’t build a platform without source
•It’s Enterprise Now!

Sunday, December 20, 2009


Goals for New Platform

Sunday, December 20, 2009


Goals for New Platform

•“Golden Timeline”

Sunday, December 20, 2009


Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data

Sunday, December 20, 2009


Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost

Sunday, December 20, 2009


Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost
•Not Hacked Together

Sunday, December 20, 2009


Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost
•Not Hacked Together
•“Collect the Social Internet”

Sunday, December 20, 2009


HOW TO SCALE

“AsDecember
Sunday, I researched scaling,
20, 2009 I
HOW TO SCALE

•What makes you special?

“AsDecember
Sunday, I researched scaling,
20, 2009 I
HOW TO SCALE

•What makes you special?


•What are you willing to sacrifice?

“AsDecember
Sunday, I researched scaling,
20, 2009 I
HOW TO SCALE

•What makes you special?


•What are you willing to sacrifice?
•How will you structure the data?

“AsDecember
Sunday, I researched scaling,
20, 2009 I
Avoiding Impedance Mismatch

Sunday, December 20, 2009


Avoiding Impedance Mismatch

•Most problems can be divided into


High or Low latency

Sunday, December 20, 2009


Avoiding Impedance Mismatch

•Most problems can be divided into


High or Low latency

•Get a lot of data eventually, or a little


now

Sunday, December 20, 2009


Avoiding Impedance Mismatch

•Most problems can be divided into


High or Low latency

•Get a lot of data eventually, or a little


now

•MapReduce vs. Sharding/Indexing

Sunday, December 20, 2009


Ecosystem
Compiled
Pig Cascading Hive
Processing
Katta / Applications

Raw
Zookeeper

MapReduce
Processing
Structured
HBase
Storage
Unstructured
Hadoop DFS
Storage

Sunday, December 20, 2009


Simple Workflow
Semantic Unstructured
Hadoop Collect
Analysis Analysis

Structured
Analysis
Hadoop + Store in
HBase HBase
Store in
Indexing
Hadoop

Lucene+ Load/
Pull
Solr+ Replicate
Indexes
Katta Shards Search

Sunday, December 20, 2009


Unstructured Processing Cluster

Semantic Unstructured Structured


Internet Collect Store
Analysis Analysis

HBase
HTML XML
Records

Sunday, December 20, 2009


Hadoop + MR

Sunday, December 20, 2009


Hadoop + MR

•Special: Crunch web-scale data fast

Sunday, December 20, 2009


Hadoop + MR

•Special: Crunch web-scale data fast


•Sacrifice: Low-Latency, Transactions,
Random Access, Updates

Sunday, December 20, 2009


Hadoop + MR

•Special: Crunch web-scale data fast


•Sacrifice: Low-Latency, Transactions,
Random Access, Updates

•Structure: Chunked flat files

Sunday, December 20, 2009


Structured Processing Cluster
Enriched Data

Structured
Analysis
Unstructured Store in
Cluster HBase
Store in Search
Indexing
Hadoop Cluster
HBase
Records
Sharded
Lucene Index
Lucene Index

Sunday, December 20, 2009


Document Structure

ContentID: 00BAC189
Title: Iron Maiden Rules
Body: I think Janick Gers is an amazing guitarist blah blah
PostDT: 20090718
ParentID: 0FDEADBEEF
Permalink: www.roadtofailure.com/post?=20

Sunday, December 20, 2009


HBase

Sunday, December 20, 2009


HBase

•Special: Scalable random/sequential


access almost as fast as RDBMS

Sunday, December 20, 2009


HBase

•Special: Scalable random/sequential


access almost as fast as RDBMS

•Sacrifice: Joins, Secondary Indexes,


Transactions (kind of)

Sunday, December 20, 2009


HBase

•Special: Scalable random/sequential


access almost as fast as RDBMS

•Sacrifice: Joins, Secondary Indexes,


Transactions (kind of)

•Structure: BigTable - column oriented

Sunday, December 20, 2009


Search Cluster

Lucene Load/
Pull
Indexes from Replicate
Indexes
HDFS Shards Search

Lucene Lucene
Indexes Indexes

Sunday, December 20, 2009


Search

Sunday, December 20, 2009


Katta + Solr

Sunday, December 20, 2009


Katta + Solr

•Special: Sharded search

Sunday, December 20, 2009


Katta + Solr

•Special: Sharded search


•Sacrifice: Consistency, high-throughput

Sunday, December 20, 2009


Katta + Solr

•Special: Sharded search


•Sacrifice: Consistency, high-throughput
•Structure: Reverse index

Sunday, December 20, 2009


BI

Sunday, December 20, 2009


BI

•Group, Sort, Filter, Count, Sum

Sunday, December 20, 2009


BI

•Group, Sort, Filter, Count, Sum


•Semi-additive (Avg) rare but not hard

Sunday, December 20, 2009


BI

•Group, Sort, Filter, Count, Sum


•Semi-additive (Avg) rare but not hard
•MapReduce Jobs

Sunday, December 20, 2009


BI

•Group, Sort, Filter, Count, Sum


•Semi-additive (Avg) rare but not hard
•MapReduce Jobs
•Faceted Search

Sunday, December 20, 2009


Examples

Sunday, December 20, 2009


Sunday, December 20, 2009
Sunday, December 20, 2009
Challenges

Sunday, December 20, 2009


Challenges

•Scaling Search

Sunday, December 20, 2009


Challenges

•Scaling Search
•Understanding Latency

Sunday, December 20, 2009


Challenges

•Scaling Search
•Understanding Latency
•What do we need ‘now’? Can
customers wait for big data?

Sunday, December 20, 2009


Challenges

•Scaling Search
•Understanding Latency
•What do we need ‘now’? Can
customers wait for big data?

•Monitoring

Sunday, December 20, 2009


Recap: Rules for Scaling

Sunday, December 20, 2009


Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife

Sunday, December 20, 2009


Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife


•Know your sacrifices

Sunday, December 20, 2009


Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife


•Know your sacrifices
•Know your specialness

Sunday, December 20, 2009


Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife


•Know your sacrifices
•Know your specialness
•Know your data structure

Sunday, December 20, 2009


Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife


•Know your sacrifices
•Know your specialness
•Know your data structure
•Ponder Latency

Sunday, December 20, 2009


What Next?

Sunday, December 20, 2009


What Next?

•HBase Analytics?

Sunday, December 20, 2009


What Next?

•HBase Analytics?
•“What would make a bank trust it”

Sunday, December 20, 2009


What Next?

•HBase Analytics?
•“What would make a bank trust it”
•Teach people to think about data

Sunday, December 20, 2009


...

Sunday, December 20, 2009


The End

company: www.visibletechnologies.com

blog: www.roadtofailure.com
twitter: @lusciouspear

bradfordstephens@gmail.com

Sunday, December 20, 2009

Vous aimerez peut-être aussi