Vous êtes sur la page 1sur 28

5/24/2016

Architecting Big Data Solutions Big Data Overview


Use Cases and Strategies

Unauthorized copying, distribution and exhibition of this presentation is Unauthorized copying, distribution and exhibition of this presentation is
punishable under law punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Traditional Data Characteristics


• Numbers
• Generated by applications like Finance, Sales, Payroll
• Well defined schema
Traditional Data Solutions • Pre-defined linking
• The data attributes hardly change
• Reside within an enterprise
• Centralized data repository
• Offline backups
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Traditional Data Processing Traditional Solutions Architectures


• Small “distances” between source and sink – instantaneous transfers • Single Centralized Data Store
• UI to Database • 3-Tier architectures
• Database to Data Processor to Database • Presentation Layer ( UI)
• Database to Reporting • Business Layer (Backend)
• Data moves to the application code for processing • Data Layer
• Data validation at the source ( no incomplete /dirty data) • Monolith code either Home grown or Brought
• RDBMS built for number crunching • Integration through custom Interfaces
• Pre-summarized and computed data • Changes require full life-cycle projects
• Reporting is primarily pre-canned

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

1
5/24/2016

Traditional Data Challenges


• Cannot handle Text Processing economically
• Cannot handle Incomplete and dirty data
• High costs for storing text data ( H/W, S/W)
• Backups restoration is time-consuming
• High management / licensing costs
Big Data Solutions
• Schema changes take significant time

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

What is Big Data? What lead to Big Data


Gartner: Big Data is high-volume, high-velocity and/or high-variety • Cloud adaption
information assets that demand cost-effective, innovative forms of • Social Media
information processing that enable enhanced insight, decision making,
and process automation. • Mobile explosion
• Variety ( Text, Video, Audio, machine data ) • Machine generated Data
• Volume ( Tera and Peta Bytes) • Data driven management
• Velocity ( not under control)
• Veracity ( dirty, incomplete, inaccurate)

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

What is a Big Data Application? Characteristics of Big Data Products


One or many of the following should be true • Open Source
• Data in Tera or Peta Bytes • Open Integration technologies / APIs
• More than one source / form • High interoperability
• Text or media data • Constantly evolving
• Huge processing loads
• Immature
• Real time stream processing
• Advanced Analytics
• Big Deployment footprint
• Changing user requirements
• Relatively cheaper

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

2
5/24/2016

Technologies
• Numerous companies / projects focused on Big Data technologies
• Mainly open source
• Cloud focused
Current Big Data Trends • Focus on “one thing” with open interfaces for integrations
• Phenomenal growth in adoption spurred by startup culture
• Number of immature alternatives in each segment

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Software Product Organizations Enterprise IT


• New domains are driving new product features • Curious and scared at the same time
• Cloud • Mandated to look at Social / Cloud / Mobile data
• Social Media
• Mobile
• Competitive pressure to be data driven
• Big Data considered necessary for cost savings • Wait and watch until technology is mature
• Flexible ad-hoc analysis capabilities demand flexible schema • Starting off proof-of-concepts
• Advanced Analytics being added to reporting solutions • Moving towards the cloud for cost savings

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Big Data Solution


Architecture Introduction

Unauthorized copying, distribution and exhibition of this presentation is Unauthorized copying, distribution and exhibition of this presentation is
punishable under law punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

3
5/24/2016

What is a Big Data Solution ? Traditional vs Big Data Solutions


Activity Traditional Applications Big Data Applications
• Acquire and assemble “big data”
Data Acquisition Data Entry by end users Databases (Traditional Applications) ,
• Various formats from diverse sources Machine Logs, Social Media
• Process and persist in scalable and flexible data stores Validation Validated during Entry Done post-acquisition
• Provide flexible open APIs for querying Cleansing Not required Required for web / social media data
• Provide advanced analytics capabilities Transformation Summarization Text-to-numbers, enrichment,
• Use Big Data Technologies to “knit” the solution than building ground summarization
up. Persistence Single Centralized RDBMS Distributed, polyglot persistence

Applications 3-Tier – business layer centered Data centered, integration oriented

Usage Reporting, Analytics, Statutory Analytics, Machine Learning, Predictive &


Prescriptive

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Historical vs Real Time


Historical Real Time
Store-and-forward Streaming
End-of-day or end-of-processing trigger Event based trigger – as it happens

Completed records Live records with updates


Architecture Template
Full publish / republish delta
No loss of data Possible loss of data
Detailed analytics Snapshot / intraday analytics
Model building prediction Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Modules of a Big Data Solution Big Data Architecture Template


• Acquisition
• Batch & Stream, multiple formats
Transform
• Transportation
• Over internet and organizational boundaries
Acquire
• Persistence Report
• Polyglot
• Transformation Transport Persist
Persist
• Cleansing, Linking, translating, summarizing
Analyze
• Reporting
• UIs and APIs
• Advanced analytics Manage
• Machine Learning, Prescriptive, Actionable
• Management

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

4
5/24/2016

About Technology options


• Only popular options are discussed
• No detailed discussions – only salient features, advantages and short
comings
Technology Options • Encouraged to seek other sources for deeper learning of these
technologies.

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Traditional way – Build from scratch Big Data way – assemble and stitch
• Monolithic applications – home grown or brought • Big data processing has 2 common demands – scalability and
• Single programming language – 1000s of LOC written reliability
• Single data store • A number of products / technologies available especially open source
• High development / maintenance costs • They support excellent open integrations

• Acquire most suitable components


• Stitch /integrate them to create a solution
• Minimal custom work
• Very fast to-production times

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Too many options


• Everyone is coming up with a product
• Each product addresses a narrow specific field
Challenges with Big Data • There is no one-product fits-all
Technologies • Everyone is trying to expand to cover other use cases
• Replacement technologies are invented in a fast pace.

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

5
5/24/2016

Immature and incomplete Not future safe


• High change rate • Technologies going out of vague before the first release of the
• Field support and services are still primitive application
• Need to still address administration and usability • Enterprises like their investments to be safe for at least 10 years
• Shortage of skilled and experienced personnel • Companies supporting most technologies are small / startups.
• Difficult to predict the future • Market deployment size not significant except for a few

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

What to expect in the next 5-10 years? Making investments future safe
• Few products would grow and become the leaders • Look for product and developer support
• Merging of products • Look at cloud options
• Fewer more mature options • Adaptions by leading companies / products
• Stable features • Open APIs and data formats

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Responsibilities
• Connect / maintain connection to the source
• Execute protocol responsibilities ( reconnects, handshakes, error
handling)
• Data Format conversion
Acquisition Module • Filtering
• Local caching
• Compression
• Encryption
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

6
5/24/2016

Source Types What to architect?


• Databases / Data warehouses • Identifying new data
• Files • Re-acquisition and retransmit
• HTTP/REST • Data Loss – not missing records
• Data Streams • Buffering at the source
• Custom applications • Security – source provider policies
• Privacy – policies
• Alerting /alarming for issues

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Best Practices
• Involve source owners to establish good handshakes
• Identifying new data
• Identifying missing data and retransmission
• Go for reliable Open APIs
• Native APIs/Formats should be standardized as early as possible Acquire - options
• Real-time vs historical – consider separate channels
• Pay attention to security and privacy

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

SQL Query SQL Query


Advantages Use Cases
• Traditional way of extracting data from Relational Databases
• Extensive support by various languages / tools • RDBMS sources
• Mature technology
/ products • Apache Hive sources
• Ability to transform (joins, group by, cube) and filter data • Very popular and mature technology
• Indexing takes care of performance without any programmer work. • Supports incremental fetches and filtering
• Encryption and compression supported
Short comings
• Limited to the database supported by them
• Inter-organizational /web access limited by
security concerns

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

7
5/24/2016

Files Files
Advantages Use Cases
• Simple and common way of exchanging / moving data
• All systems/ applications provide file based output • Inter-organizational data
• Data converted to files (CSV, TSV, XML, JSON) for data movement
• Media, text files can be easily stored in files • Work easily with inter-organizational boundaries • Media files
• Common tools / commands utilities can achieve
results
• Secure data encryption/
compression
Short comings
• Slow
• Too many manual steps/ stop points
• Data exposed unless properly encrypted

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

REST APIs REST APIs


Advantages Use Cases
• A web-based API standard for exchanging data and performing CRUD
operations • Standard for Internet Data exchange. • Cloud/ Social media data
• Excellent security and scalability sources
• Decouples consumers from producers
• Easy to integrate • Mobile data sources
• Stateless existence
• Real time meta data
• Uniform interface across sources : GET, POST, PUT, DELETE
• Support advanced security (Oauth2) and encryption Short comings
• Support by most cloud and mobile data sources (Twitter, Facebook, • Redundant information (since stateless)
Salesforce etc.) • Rate limitations by providers
• Does not support real time

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Streaming Streaming
Advantages Use Cases
• Real time data subscribe / publish model
• Real time instantaneous data transfer • Real time sentiment
• Client subscribes to a specific topic/ sub-set of data analysis
• Can send only “diffs” – small footprint
• HTTP connection is kept “open” • Real time reporting
• Support by all major cloud providers ( Twitter,
• Server pushes data to client whenever new data is available Facebook, Salesforce) • Real time actions based on
• Uses secure keys and encryption user behavior
Short comings
• Loss of data if connection broken
• Rate limitations by providers
• Might need to be supplemented by historical
data pulls

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

8
5/24/2016

Types
• Store and Forward
• Receive from a source “at rest”
• Move data in units
• Track completion
Transport module • Retransmit if required
• Streaming
• Continuous moving data stream
• Throttle at source
• Throttle for sink
• Inflight storage
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Responsibilities What to architect


• Maintain link with acquisition module • Speed
• Translate data to protocol optimal formats • Throttling
• Move data • Reliability of data (no loss in transport)
• Secure Data • Redundancy
• Maintain link with Persistence module • Scalability
• Save data in Persistence module (and confirm) • Status Reporting and alarming
• Track data as it moves • Compression
• Re-transport in case of failures • Encryption
• Reporting

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Best Practices
• Do not reinvent the wheel
• Piggy back on proven messaging/transfer frameworks and protocols.
• Look for integrations between transport technologies with
acquisition, transformation and persistence technologies.
• Be aware of data transport costs
Transport - Options
• Use reliable inflight storage
• Consider security measures to prevent data theft.

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

9
5/24/2016

File Move / Copy File Move / Copy


Advantages Use Cases
• Simplest way of moving large files
• Simple and straight forward to use • Intra-enterprise
• Supported on all operating systems
• No special skills • Media Files
• Inter-operating system transfers would require adapters
• Can be quickly scheduled /automated

Short comings
• Inter- O/S moves require adapters
• WAN moves might have reliability issues
• Management of large files and file sets
difficult

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

SFTP ( Secure File Transfer Protocol) SFTP


Advantages Use Cases
• Network Protocol for File Access and Transfer
• Wide Support across O/S, tools and utilities • Inter-enterprise file sharing
• Uses a secure channel for data protection (SSH)
• Mature and widely accepted • Media file transfers
• Authentication/authorization (SSH)
• Data security across internet / VPN/ WAN • Log files
• Data integrity checks
• Can resume interrupted transfers
• Files carry basic source attributes like timestamps Short comings
• Firewalls need to be enabled for SFTP
• Wide support across O/S, tools and utilities
• Passwords need to be shared between parties
• Slower transfer speeds due to encryption

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Apache Sqoop Apache Sqoop


Advantages Use Cases
• A Command Line tool for transferring data between relational
databases and Apache Hadoop • Simple and straight forward to use • Hadoop based backups
• Parallelism to speed up transfers and data warehouses
• Entire databases, tables or results of SQL
• Bi-directional • Move data to HBase/Hive
• Support for Avro, Sequence, Parquet or Plain text files
• Analyzed data from
• Hive and HBase Support Hadoop to RDBMS
• Parallelism Short comings
• Incremental transfers • Predominantly Linux
• BLOB support • Open security
• No Transformation support
• No streaming

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

10
5/24/2016

Apache Flume Apache Flume


Advantages Use Cases
• A distributed Service for collecting, aggregating and moving large amounts
of log/ streaming data • Configuration driven • Web log shipping
• Each origin has a source component to get events • Massively scalable • Twitter streaming
• A Channel used to transport data • Customizable through code
• A sink where the event is deposited • Edge server events
processing
• Sources can span a large number of servers
• Support for multiple sources ( files, strings, HTTP POSTs, twitter streams) Short comings
• Support for multiple sink types • No ordering guarantees
• Can add custom sources and sinks through code • Duplicates possible
• Robust, fault tolerant, has throttling, failover and recovery capabilities • No replication
• Inflight data processing through interceptors
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Apache Kafka Apache Kafka


Advantages Use Cases
• An open source message broker platform for real time data feeds
• Highly scalable Real time messaging system • Real time analytics
• Publish – subscribe architecture.
• Multiple subscribers
• Developed at LinkedIn, written in Scala. • Operational metrics
• Ordering guarantees aggregation
• Topics are published. Multiple subscribers can be there for a topic
• Complex event processing
• Ordering guarantees
• Coding required for the publisher and the subscriber to interface with Short comings
Kafka • Coding required for publishers and subscribers
• Replication and high availability • Support (compared to flume)

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Responsibilities
• Reliable data storage
• ACID (Atomicity, Consistency, Isolation, Durability)
• Schema
Persistence module • Transactional support
• Data Access
• Response times
• Scalability (multi-cluster, shared-nothing etc.)

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

11
5/24/2016

What to architect Best Practices


• Scalability • Horses for Courses / polyglot persistence
• Consistency • Keep schema /design flexible
• Transactions • Keep data at lowest granularity
• Read-intensive vs write-intensive • Summarize only if needed
• Mutable vs immutable data • Consider real time application needs
• cataloging (meta-data) • Do away with backups
• Latency: real time vs historical
• Standard vs adhoc loads
• Flexible Schema

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

RDBMS
• Still has a Big role in Big Data Architectures
• Stores data in Tables and Columns
• Optimized for numbers
Persist - Options • Excellent Query performance
• Limitations in scalability.
• Schema need to be predefined
• Few mature options –Oracle, MySQL, PostgreSQL

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

RDBMS HDFS
Advantages Use Cases
• A distributed file system that can span across hundreds of nodes
• Very mature technology • Meta Data
• Multiple copies of the same file eliminates need for backups
• Query Performance • Multi Update cases
• Can run on commodity servers
• 3rd Party / Tool support • Work In Progress Data
• Excellent ACID support • Resilient to node failures
• Summary data
• Open Source Apache project
Short comings • Results
• Allows for parallel execution of Map Reduce tasks
• Scalability in TB/PB
• Rigid Schema
• Cost
• Text storage

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

12
5/24/2016

HDFS Cassandra
Advantages Use Cases
• Wide Column store (Big Table)
• Massively scalable and reliable • Log files
• Open source developed by Facebook
• Allows parallel data processing • Media files (recordings)
• Dynamic Schema
• No backups needed • Online backup for RDBMS
• Very cost effective (compare to SAN) data • Decentralized architecture
• Single index for each table
Short comings • Excellent single-row query performance
• No indexes. Access can be very slow
• Bad range scan performance
• Security concerns
• No aggregation support
• Limited to Java programming

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Cassandra MongoDB
Advantages Use Cases
• Document Oriented Database (JSON)
• Excellent scalability and performance • Customer 360
• Strong consistency
• Strong security • Monitoring Statistics and
analytics • Expressive Query Language
• Multiple writes with excellent performance
• Location based lookup • Multiple Indexes
• Aggregations
Short comings • Replication and failover
• Transaction support
• Master Slave Model
• Adhoc querying
• No support for Group By or Joins

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Mongo DB Neo4j
Advantages Use Cases
• Graph oriented database
• Open Source • Store documents • Deals with Relationships – nodes and edges
• Multiple indexes • Write once read many data • ACID Compliant
• Strong and rich SQL Support stores
• • Transaction support
Ease of use • Real time analytics
• Cypher – Graph Query Language
• Possible RDBMS
Short comings • Easy to program complex joins
replacement
• Transaction support
• Fast relationship traversals
• No foreign keys
• No Joins

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

13
5/24/2016

Neo4j ElasticSearch
Advantages Use Cases
• A full text search-engine
• Transaction and ACID support • Master Data Management
• A distributed document store
• Excellent Query support • IT Network modeling
• Every field is indexed and searchable
• Fast traversal of nodes (join equivalents) • Social network modeling
• Maps easily to OO applications • Can scale hundreds of servers for structured and unstructured data
• Identity and Access
Management • Aggregation support
Short comings
• No built in user management
• Aggregation support
• Not suitable for data without a lot of
relationships

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

ElasticSearch
Advantages Use Cases
• Outstanding search capabilities • Not recommended as
• Aggregation support primary data store
• Flexible schema • Adhoc query building and
aggregation
• Real time analytics
Transformation module
Short comings
• No ACID support
• No SQL support
• Data loss risks
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Responsibilities What to architect?


• Cleansing • Real time vs historical
• Filtering • Templates
• Unwanted, incomplete • DE normalization
• Standardization • Reprocessing
• Format, content
• Parallelism
• Enrichment • Setup
• ID-names mapping, bucketing, categorization • Code
• Integration • Speed
• Between data sources
• Work-In-Progress Storage
• Summarization / Pivoting

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

14
5/24/2016

Best Practices
• Keep Real time and historical separate for data > Tera bytes
• Map-Reduce
• Don’t reinvent the wheel
• Build template code /functions for enterprise known use cases
• Keep intermediate data for some time to enable reprocessing
Transform - Options
• Summarize only if really required. Keep data in original detail
• Build monitoring capabilities for performance

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Custom Code Custom Code


Advantages Use Cases
• Write custom code in your favorite programming language
• Custom built to your needs and situations • Not recommended unless
• Build-it-yourself. Think before you got there your use case has no
• Easy integration with custom sources and
• Scalability readymade solutions
sinks
• Reliability
• Reuse existing computation code
• Parallelism
• People who built such actually ended up building products that we
use today Short comings
• Too much to build and maintain
• Long cycle times
• Heavy resource requirements

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Hadoop Map-Reduce Hadoop Map-Reduce


Advantages Use Cases
• The first big data processing technology
• Parallelism helps handle huge data processing • Batch mode processing
• Code move to where data resides loads
• Text mining
• Mappers work in parallel on individual records and transform • Can easily handle text and flexible schema
• Data Cleansing & Filtering
• Reducers summarize and aggregate • Custom code for business functionality
• Analyzing media files
• Series of map-reducers work on a pipeline
• Uses cheap hardware with extreme parallelism Short comings
• Not suitable for real time
• Reducers can be choking points
• Need developers with Map-reduce thinking
skills

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

15
5/24/2016

*QL Query *QL Query


Advantages Use Cases
• Data Products today have some form of SQL support either native or
through other interface • Minimum effort, maximum returns • Filtering
• Hive, Impala • *QL engines are optimized for performance • Summarization
and parallelism
• Can use *QLs to do • Copying Data
• Filtering • Scripts would suffice
• Cleansing
• Transformation Short comings
• Summarization • Limited capabilities (without programming)
• Insert/update back to source • Combining different data sources /sinks is
• The SQL engine does the heavy lifting difficult

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Apache Spark Apache Spark


Advantages Use Cases
• New Generation general data processing engine.
• Flexible, fast and powerful • Wide range of use cases
• Eliminates a number of short comings of traditional Map-Reduce
• Supports different types of processing • Interactive processing
• Works on data in memory and in distribution fashion capabilities
• Stream processing
• Supports Map-Reduce type operations, but a lot faster. • Can run with Hadoop
• De facto standard ?
• Can work on Streaming real time data
• Support for Java, Python, R Short comings
• Has SQL and Graph capabilities • Significant coding effort
• Immature
• Interactive processing capabilities
• Significant h/w resource requirements

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

ETL Products ETL Products


Advantages Use Cases
• Talend, Pentaho, Jaspersoft, Snaplogic etc.
• Easy to build flows • Any use case is supported
• Commercial and Open Source offerings on paper, devil in the
• Good integration with various data sources
• Visual ETL / pipeline builders with almost no coding and sinks details
• Can build flows from acquisition to transformation to storage • Management capabilities • Please try out the product
• Custom functions possible before jumping in
• Operational management available Short comings
• Can get complex pretty quickly
• Maturity in question
• Can get pricey for commercial offerings
• Inter-organizational flows can become tricky
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

16
5/24/2016

Responsibilities
• Canned Reports
• Do-it-yourself report designer
• Dashboard Designer
Reporting Module • API to extract data from the persistence layer
• Authentication and Authorization
• Real time data presentation
• Alerting

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

What to architect Best Practices


• Response times • Pick a tool with easy to use graphics capabilities
• Mobile and web access • Tool should have integrated with variety of data sources
• Personalization • Aggregate on the fly without compromising on performance
• Advanced Graphical capabilities • Use open standards for data access/integrations
• Threshold management • Provide for personalized dashboards
• Integration with other systems • Design for multiple interfaces – mobile, web, embedded
• Search • Search is a key capability today

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Cloudera Impala
• In-memory distributed query engine for Hadoop
• Interactive Shell for Queries
• Very fast compared to Hive (no Map Reduce)
Reporting Options • Supports Joins, sub-queries, aggregations
• Supports Hadoop and HBase
• Available ODBC Drivers and Thrift APIs

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

17
5/24/2016

Cloudera Impala Spark SQL


Advantages Use Cases
• Provides SQL like programming – easy to use
• Very fast access for Hadoop data • Adhoc querying on Hadoop
• Internally implemented as Map-Reduce operations on Spark RDDs.
• Familiar SQL interface through an interactive • API interface for file
shell management applications • Very fast and flexible
• Strong integration with Hadoop • HBase data access • Supports aggregations and joins
• Machine Learning integration with Spark ML
Short comings • Can be used for interactive and stream programming
• No graphics
• No fault tolerance
• Support for other data stores

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Spark SQL 3rd Party Tools


Advantages Use Cases
• A number of open source and commercial options exists for 3rd Party
• Rich set of capabilities • Programmed Querying of big data analytics tools
• Familiar syntax large datasets
• The choice would narrow down based on cost, familiarity and use
• Excellent performance scalability (Spark) • Single system for ETL, case matching
• Multiple languages – Java, Scala, Python Analytics and Advanced
Analytics • Options include Tableau, Pentaho, Jasper, QlikView, Birst
• Easy integration with other libraries through
Spark/Scala/ Java
• Real time analytics • Excellent Graphics capabilities
Short comings • Integration through native or ODBC/JDBC drivers
• No graphics • Visual Design capabilities
• Programming skills required • Authentication and authorization integrations

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

3rd Party tools Elastic


Advantages Use Cases
• ElasticSearch provides an excellent search engine on existing data
• Rich Graphics • Enterprise Dashboards and
Reports • Kibana provides visualization capabilities on ElasticSearch data
• Templates for Visualizations and Dashboards
• Multiple Data Sources • Aggregation capabilities
• Easy to use Designers
• Authentication and Authorization • Visual Designers • Streaming data support
• Customization • Graphics
• Scalability
Short comings • Search
• Cost
• Native support levels differ

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

18
5/24/2016

Elastic
Advantages Use Cases
• Rich Graphics • Enterprise Dashboards and
• Flexible Query capabilities Reports
• Real time analytics • Adhoc Query UIs
• Search • Real time Monitoring Advanced Analytics Module
Short comings
• Additional work populating ElasticSearch
• Accuracy issues
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Types of Analytics Responsibilities


Type of Description • Model building capabilities
Analytics • Supervised and unsupervised
Understand what happened • Support for Validation techniques
Descriptive
• Ensemble algorithms
Exploratory Find out why something is happening
• Interactive development
Inferential Understand a population from a sample • Automation
Predictive Forecast what is going to happen • Predictions

Causal What happens to one variable when you change another

Deep Use of advanced techniques to understand large and multi-


source datasets
Copyright @2016 V2 Maestros, All rights reserved.

What to Architect Best Practices


• Scalability • Architecture aligned with methodology – work with data scientists
• Performance – especially predictions • Plan adhoc model building – make sure they do not affect other
• Validations – both model and predictions processing
• Algorithms – options, configurations, tuning • All Advanced Analytics projects don’t yield results
• Set expectations right
• Automation and operationalization • Choose easy projects initially
• Keep an eye on automation and operationalization

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

19
5/24/2016

R
• Language / Environment for Statistical Computing and Graphics
• Long history of use by Statisticians
• Wide package set for various machine learning algorithms
Advanced Analytics - Options • Data Cleansing, Transformation and Graphics capabilities
• RStudio – IDE for interactive programming
• Runs on data in memory
• Limited to the memory of the node where it runs

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

R Python
Advantages Use Cases
• Standard Programming language that has Big Data / Data Science
• Excellent set of machine learning algorithms • Interactive Model building related packages and capabilities
• Graphics and other presentation tools / Trials on small data sets • NumPy, SciPy, Pandas, Scikit-Learn
• Interactive model building capabilities • Small data science • Vast array of third party libraries
• Mature applications
• Data Cleansing and Graphics capabilities
• Presentations
Short comings • IDEs available for interactive programming
• Scaling limited to the local memory • Integration with Apache Spark
• R cannot be used to build robust application • Multi purpose language – can be used for other capabilities too
suites
• Big Data capabilities are limited

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Python Apache Spark


Advantages Use Cases
• Apache Spark has ML Library – supports a good set of machine
• Graphics and Data Cleansing tools • Interactive Model building learning algorithms
• Interactive model building capabilities / Trials on small data sets
• ML library uses Data Frames from Spark SQL
• Integration with Apache Spark • Data Cleansing
• Standardized approach – similar usage for all algorithms
• Easier learning curve compared to R • Small Advanced Analytics
applications • ML algorithms scale horizontally across a cluster
Short comings • Can use Java, Scala or Python
• Scaling limited to the local memory • Interactive model building possible – can run on a windows desktop
• Limited ML implementations compared to R • Excellent integration with Big Data Sources
• Real time analytics / predictions ossible with Streaming

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

20
5/24/2016

Apache Spark Commercial Software


Advantages Use Cases
• Tableau, SAS, RapidMiner etc.
• Scalability • Predictive modeling for
• Interactive model building capabilities large datasets • Good set of algorithms
• Real time predictions • Model building on NoSQL • Good graphics support
• Support for various data sources and tools data sources • Can Scale
• Support for multiple programming languages
• Real time predictions • Can use various Data sources
Short comings • Very Pricey.
• No Graphics
• No IDE; only a shell for interactive
programming
• Limited algorithms /implementations
• Not mature; fast evolving
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Use Case Description


• ABC Enterprise currently keeps 18 months of CRM data in RDBMS and
7 years of archived data on tapes.
Use Case 01 : • ABC Enterprise wants to move from tape backups to HDFS backups
• Access of data is easier
Enterprise Data Backup • Can use commodity hardware with potential to move to the cloud
• No offline backups required
• Provide adhoc querying capability on the data

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Characteristics Enterprise Data Backup Architecture


Characteristic Type Notes
Sources RDBMS
Scheduler
Data Types Numeric and relational
Mode Historical
Source Sqoop Impala
Data Acquisition Pull HDFS
RDBMS
Availability (RT / HR) After 1 day Data needs to be available in the data warehouse after 1
day since the original data is created
Store type Write once, read many
Response Times As good as possible Given adhoc querying requirements, queries can run for a
few seconds.
Model building none

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

21
5/24/2016

Solution
Module Choice Notes

Acquire Sqoop Default choice for Database Extract


Use Case 02 :
Transport N/A
Media File Store
Persist HDFS Store in native HDFS format as Sequence Files

Transform N/A

Reporting Impala Basic adhoc querying tool


Unauthorized copying, distribution and exhibition of this presentation is
Advanced
Analytics
N/A punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Use Case Description Characteristics


• ABC Enterprise has contact center where all calls are recorded. These Characteristic Type Notes

recordings need to be archived for statutory reasons and analytics Sources Contact Center recording
solutions
• ABC Enterprise wants to move from tape archive to online archive
Data Types Media files
• Provide adhoc querying capability on the data Mode Historical
Data Acquisition Pull
Availability (RT / HR) After 1 day Data needs to be available in the media store after 1 day
since the original data is created
Store type Write once, read many
Response Times As good as possible Given adhoc querying requirements, queries can run for a
few seconds.
Model building none

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Enterprise Data Backup Architecture Solution


Module Choice Notes

Scheduler HDFS
Acquire Files Only choice for media files

FTP Transport FTP Easy transfer; security and compression capable


Media file Media
source Analyzer Media
Reports Persist HDFS, MySQL Media files stored in HDFS; Meta-data and analytics stored in MySQL

Transform Custom Custom Media Analyzer for tagging media files and storing meta data

MySQL Custom Media Reporting tool to analyze meta data and listen to
Reporting Impala
recordings
Advanced
N/A
Analytics
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

22
5/24/2016

Use Case Description


• XYZ news corporation tracks popular topics in social media and uses
them for their news reporting
Use Case 03 : • They want an automated system to capture social media interactions
on popular topics and do real time sentiment analysis
Social Media Sentiment Analysis • Sentiment Analysis need to be summarized and archived for future
analysis too.

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Characteristics Sentiment Analysis Architecture


Characteristic Type Notes
Sources Twitter, Facebook Social media popular topics. Topics are configurable
Data Types Tweets, posts (JSON)
Mode Real time
Data Acquisition Streaming / push Spark
Apache
Availability (RT / HR) Real time On the fly analytics Kafka Streaming Cassandra

Store type Write many, read many


Response Times Real time Given adhoc querying requirements, queries can run for a
few seconds. News
Topics
Model building Sentiment Analysis Summary
Configuration

Copyright @2016 V2 Maestros, All rights reserved.

Solution
Module Choice Notes

Acquire Streaming Streaming supported by all social media websites


Use Case 04 :
Transport Kafka
Kafka provides scalable real time transport for data. Has interfaces to
Twitter streaming as well as Spark Credit Card Fraud Detection
Persist Cassandra Store data by topic. The social media topic would be used as the key.

Transform Apache Spark Real time stream subscription and transformation

Reporting Custom Custom application for reading Cassandra data and summarizing for news
Unauthorized copying, distribution and exhibition of this presentation is
Advanced
Analytics
Apache Spark Sentiment Analysis on the fly with stream processing punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

23
5/24/2016

Use Case Description Characteristics


• ABC Systems runs a web based retail solution where customers can Characteristic Type Notes

order any kind of products (like Amazon) Sources Internal /web Data is captured in real time while payment is being made
transactions on the web
• Sometimes credit card thieves use stolen CC information to make
Data Types Numeric / CRM
purchases. This later results in loss of revenue
Mode Real time / Historical Historical data collection; prediction in real time
• ABC systems needs a real time Credit Card Fraud prediction system so
Data Acquisition Streaming / push Data pushed from browser as transactions happen
that the purchase is blocked before its complete.
Availability (RT / HR) Real time Real time predictions
Store type Write once, read many
Response Times Minimal Prediction need to be made when the purchase is made.
Model building Binary Classification Model to predict if a transaction is fraudulent or not.

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Credit Card Fraud Detection Solution


Module Choice Notes

Acquire Web Events Generated by custom web app. Deployed on a web farm
Apache
Kafka Kafka provides scalable real time transport for data. Web Transaction
Custom Transport Kafka
Web Fraud events from web app.
Payment Mongo DB Input
Web events/ transactions accumulated and stored in Mongo DB; Also
App Persist MongoDB
models built are stored in Mongo DB

Model Transform Spark


Building
Fraud
Prediction Apache Architecture can be enhanced to add adhoc reporting on the web
Reporting None
App Spark transactions if required.
Advanced
Apache Spark Binary Classification model building
Analytics
Copyright @2016 V2 Maestros, All rights reserved.

Use Case Description


• ABC Systems runs a cloud based data center with 100s of VM nodes.
This data center needs to kept operational 24 X 7
Use Case 05 : • To help management, they want to setup a reporting /analytics
system that will provide the following
Operational Analytics • Real time node health monitoring
• Historical root cause analysis for problems
• Predict node failures
• They need you to architect a big data solution to achieve the same.

Unauthorized copying, distribution and exhibition of this presentation is


punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

24
5/24/2016

Characteristics Operational Analytics


Characteristic Type Notes
Apache
Sources Server Logs Server logs generated by VMs and applications
HDFS Spark
Data Types Text Log messages. Might have predefined structures
Web
Mode Real time node Log Msgs Apache
farm Flume
Data Acquisition Streaming / push
Spark Cassandra
Availability (RT / HR) Real time Real time monitoring needs real time feeds
Streaming
Store type Write many, read many
Response Times Real time Real time network health reporting required
Operations
Model building Classification Predict node failures Dashboard

Copyright @2016 V2 Maestros, All rights reserved.

Solution
Module Choice Notes

Acquire Log Files Log files from VMs and applications


Use Case 06 :
Transport Flume
Flume agent on all VMs help transport log files through to the Big data
layer
Cassandra stores summary/analytics data by each VM (index). HDFS
News Articles recommendations
Persist Cassandra, HDFS
stores raw log files
Real time stream subscription and transformation to compute real time
Transform Apache Spark
monitoring statistics; Historical log analysis for statistics
Use a good 3rd Party reporting solution for real time and historical
Reporting 3rd Party
reporting
Unauthorized copying, distribution and exhibition of this presentation is
Advanced
Analytics
Apache Spark Prediction for node failure. Information stored as part of the node record. punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Use Case Description Characteristics


• ABC news corporation hosts a website where users can read articles Characteristic Type Notes

about various day-to-day happenings Sources Web Click events Web clicks generated through Browser logs

• For every registered user who logs in, ABC news wants to provide a Data Types Text / URLs URLs that are clicked by the user

list of recommended news articles in real time (articles you might like) Mode Real time / Historical

• They want to build a web click analysis / prediction system Data Acquisition Push
Availability (RT / HR) Real time Real time user and item relationships
Store type Write many, read many Users, articles and relationships need to be stored.
Response Times Real time Real time recommendations / predictions
Model building Collaborative Filtering, Predict similar articles based on similar users
Association Rules

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

25
5/24/2016

News Article Recommendations Solution


Module Choice Notes
Apache
Kafka HDFS
Web Click /
Acquire Web Clicks from browser collected through a farm of Web Servers
Events
Apache
Spark Transport Kafka Kafka transports clicks from web servers to the centralized HDFS
Web
Browser
HDFS stores raw events; Neo4j stores nodes (users, articles) and
Persist Neo4j, HDFS
Custom relationships ( user-user, article-article, user-article)
Recomm Neo4j
Server Transform Apache Spark

Reporting N/a

Advanced Historical event analysis and generation of collaborative filtering and


Apache Spark
Analytics association rules results
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Use Case Description


• XYZ Corporation produces and sells computers and accessories.
• It want to track and manage all information about their customers
and create a Customer 360.
• Web site activity
Use Case 07 : Customer 360 •

Purchase history
Issues / Contact Center history
• Social Media
• It then wants to use this information for prospective selling
• When customer visits website
• When customer calls contact center
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Characteristics Customer 360


Characteristic Type Notes
Sources Web site, CRM, Contact Multiple sources of data Browser
Center, social media
Apache
Data Types Numbers, Text, All types /kinds of data used Kafka HDFS
Recordings Social
Media
Mode Historical Processing of data is historical.
Spark Cassandra
Data Acquisition Push & Pull Depending on data
Availability (RT / HR) Historical Historical data fetch and data integration Scheduler
CRM
Store type Write many, read many Store transaction history as well as customer 360 Mongo Custom
Sqoop DB Reco.
Response Times Real time Real time profile fetches for real time web marketing Contact Server
Center
Model building Classification, Clustering, Predict similar articles based on similar users
Collaborative Filtering

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

26
5/24/2016

Solution
Module Choice Notes

Acquire
Web Clicks,
Transactions, SM Data
Multiple sources; different types
Use Case 08 :
Transport Kafka and Sqoop

HDFS, Mongo DB,


Kafka for event streaming; Sqoop for fetching from RDBMS

HDFS stores raw events; Mongo DB for transactions; Cassandra for


IoT - The Connected Car
Persist
Cassandra Customer Profiles

Transform Apache Spark Ingest from multiple sources and create profile

Data in Mongo DB and Cassandra can be used for Customer


Reporting 3rd Party
Analytics UIs /Dashboards
Unauthorized copying, distribution and exhibition of this presentation is
Advanced
Analytics
Apache Spark Generate Customer classification based on multi-source data punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

Use Case Description Characteristics


• PQR Car company wants to connect cars in real time to analytics Characteristic Type Notes

engine Sources Car sensors Sensors in car

• Cars have multiple sensors. Sensor data need to be analyzed (real Data Types Numbers Numeric event sensor data

time / historical) to generate alarms for possible failures to the driver Mode Historical / Real time Critical data processed real time. Rest historical

• PQR needs a satellite enabled data collection and alarm system Data Acquisition Push Sensors send data to collection centers
backed by a big data infrastructure Availability (RT / HR) Real time Real time alarms needed
Store type Write many, read many Car profile need to be stored
Response Times Real time Real time profile fetches for real time alarming
Model building Car issue prediction Predict possible future issues

Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

IoT : The connected car Solution


Apache Module Choice Notes
HDFS Spark
Car Data Events from Car
Sensors Collection Apache Acquire Embedded in every car
Events Sensors
Collectors nodes Flume
Events sent through satellite/mobile to collection centers; Kafka
Transport Custom and Kafka
then transports from the centers to the big data repository
Spark
CAR
Streaming Persist HDFS, Cassandra HDFS stores raw events; Cassandra stores car profiles
Car
Alarming Cassandra Transform Apache Spark Ingest from multiple sources and create car profile
System
Alarms
Server
Reporting Custom Alarm system inside car

Advanced
Apache Spark Generate predictions based on car profiles
Analytics
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

27
5/24/2016

Transitioning to Big Data


• Start with new use cases
• Advanced Analytics
• Social Media
• Mobile
Transitioning to Big Data • Leave what works alone for now
• Choose simple use cases with quick results to convince others
• Work on Proof-of-concept projects
• Be careful about advanced analytics results
• Build sand boxes and move them gradually to production
Unauthorized copying, distribution and exhibition of this presentation is
• Think scalability always
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.

28

Vous aimerez peut-être aussi