Architecting Big Data Solutions Handouts PDF

5/24/2016
Architecting Big Data Solutions Big Data Overview

Use Cases and Strategies
Unauthorized copying, distribution and exhibition of this presentation is Unauthorized copying, distribution and exhibition of this presentation is
punishable under law punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Traditional Data Characteristics

• Numbers
• Generated by applications like Finance, Sales, Payroll
• Well defined schema
Traditional Data Solutions • Pre-defined linking
• The data attributes hardly change
• Reside within an enterprise
• Centralized data repository
• Offline backups
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Traditional Data Processing Traditional Solutions Architectures

• Small “distances” between source and sink – instantaneous transfers • Single Centralized Data Store
• UI to Database • 3-Tier architectures
• Database to Data Processor to Database • Presentation Layer ( UI)
• Database to Reporting • Business Layer (Backend)
• Data moves to the application code for processing • Data Layer
• Data validation at the source ( no incomplete /dirty data) • Monolith code either Home grown or Brought
• RDBMS built for number crunching • Integration through custom Interfaces
• Pre-summarized and computed data • Changes require full life-cycle projects
• Reporting is primarily pre-canned
1
5/24/2016
Traditional Data Challenges

• Cannot handle Text Processing economically
• Cannot handle Incomplete and dirty data
• High costs for storing text data ( H/W, S/W)
• Backups restoration is time-consuming
• High management / licensing costs
Big Data Solutions
• Schema changes take significant time

What is Big Data? What lead to Big Data

Gartner: Big Data is high-volume, high-velocity and/or high-variety • Cloud adaption
information assets that demand cost-effective, innovative forms of • Social Media
information processing that enable enhanced insight, decision making,
and process automation. • Mobile explosion
• Variety ( Text, Video, Audio, machine data ) • Machine generated Data
• Volume ( Tera and Peta Bytes) • Data driven management
• Velocity ( not under control)
• Veracity ( dirty, incomplete, inaccurate)
What is a Big Data Application? Characteristics of Big Data Products

One or many of the following should be true • Open Source
• Data in Tera or Peta Bytes • Open Integration technologies / APIs
• More than one source / form • High interoperability
• Text or media data • Constantly evolving
• Huge processing loads
• Immature
• Real time stream processing
• Advanced Analytics
• Big Deployment footprint
• Changing user requirements
• Relatively cheaper
2
5/24/2016
Technologies
• Numerous companies / projects focused on Big Data technologies
• Mainly open source
• Cloud focused
Current Big Data Trends • Focus on “one thing” with open interfaces for integrations
• Phenomenal growth in adoption spurred by startup culture
• Number of immature alternatives in each segment

Software Product Organizations Enterprise IT

• New domains are driving new product features • Curious and scared at the same time
• Cloud • Mandated to look at Social / Cloud / Mobile data
• Social Media
• Mobile
• Competitive pressure to be data driven
• Big Data considered necessary for cost savings • Wait and watch until technology is mature
• Flexible ad-hoc analysis capabilities demand flexible schema • Starting off proof-of-concepts
• Advanced Analytics being added to reporting solutions • Moving towards the cloud for cost savings
Big Data Solution

Architecture Introduction
Unauthorized copying, distribution and exhibition of this presentation is Unauthorized copying, distribution and exhibition of this presentation is
punishable under law punishable under law
3
5/24/2016
What is a Big Data Solution ? Traditional vs Big Data Solutions

Activity Traditional Applications Big Data Applications
• Acquire and assemble “big data”
Data Acquisition Data Entry by end users Databases (Traditional Applications) ,
• Various formats from diverse sources Machine Logs, Social Media
• Process and persist in scalable and flexible data stores Validation Validated during Entry Done post-acquisition
• Provide flexible open APIs for querying Cleansing Not required Required for web / social media data
• Provide advanced analytics capabilities Transformation Summarization Text-to-numbers, enrichment,
• Use Big Data Technologies to “knit” the solution than building ground summarization
up. Persistence Single Centralized RDBMS Distributed, polyglot persistence
Applications 3-Tier – business layer centered Data centered, integration oriented
Usage Reporting, Analytics, Statutory Analytics, Machine Learning, Predictive &

Prescriptive
Historical vs Real Time

Historical Real Time
Store-and-forward Streaming
End-of-day or end-of-processing trigger Event based trigger – as it happens
Completed records Live records with updates

Architecture Template
Full publish / republish delta
No loss of data Possible loss of data
Detailed analytics Snapshot / intraday analytics
Model building prediction Unauthorized copying, distribution and exhibition of this presentation is
Modules of a Big Data Solution Big Data Architecture Template

• Acquisition
• Batch & Stream, multiple formats
Transform
• Transportation
• Over internet and organizational boundaries
Acquire
• Persistence Report
• Polyglot
• Transformation Transport Persist
Persist
• Cleansing, Linking, translating, summarizing
Analyze
• Reporting
• UIs and APIs
• Advanced analytics Manage
• Machine Learning, Prescriptive, Actionable
• Management
4
5/24/2016
About Technology options

• Only popular options are discussed
• No detailed discussions – only salient features, advantages and short
comings
Technology Options • Encouraged to seek other sources for deeper learning of these
technologies.

Traditional way – Build from scratch Big Data way – assemble and stitch
• Monolithic applications – home grown or brought • Big data processing has 2 common demands – scalability and
• Single programming language – 1000s of LOC written reliability
• Single data store • A number of products / technologies available especially open source
• High development / maintenance costs • They support excellent open integrations
• Acquire most suitable components

• Stitch /integrate them to create a solution
• Minimal custom work
• Very fast to-production times
Too many options

• Everyone is coming up with a product
• Each product addresses a narrow specific field
Challenges with Big Data • There is no one-product fits-all
Technologies • Everyone is trying to expand to cover other use cases
• Replacement technologies are invented in a fast pace.

5
5/24/2016
Immature and incomplete Not future safe

• High change rate • Technologies going out of vague before the first release of the
• Field support and services are still primitive application
• Need to still address administration and usability • Enterprises like their investments to be safe for at least 10 years
• Shortage of skilled and experienced personnel • Companies supporting most technologies are small / startups.
• Difficult to predict the future • Market deployment size not significant except for a few
What to expect in the next 5-10 years? Making investments future safe
• Few products would grow and become the leaders • Look for product and developer support
• Merging of products • Look at cloud options
• Fewer more mature options • Adaptions by leading companies / products
• Stable features • Open APIs and data formats
Responsibilities
• Connect / maintain connection to the source
• Execute protocol responsibilities ( reconnects, handshakes, error
handling)
• Data Format conversion
Acquisition Module • Filtering
• Local caching
• Compression
• Encryption
6
5/24/2016
Source Types What to architect?

• Databases / Data warehouses • Identifying new data
• Files • Re-acquisition and retransmit
• HTTP/REST • Data Loss – not missing records
• Data Streams • Buffering at the source
• Custom applications • Security – source provider policies
• Privacy – policies
• Alerting /alarming for issues
Best Practices
• Involve source owners to establish good handshakes
• Identifying new data
• Identifying missing data and retransmission
• Go for reliable Open APIs
• Native APIs/Formats should be standardized as early as possible Acquire - options
• Real-time vs historical – consider separate channels
• Pay attention to security and privacy

SQL Query SQL Query

Advantages Use Cases
• Traditional way of extracting data from Relational Databases
• Extensive support by various languages / tools • RDBMS sources
• Mature technology
/ products • Apache Hive sources
• Ability to transform (joins, group by, cube) and filter data • Very popular and mature technology
• Indexing takes care of performance without any programmer work. • Supports incremental fetches and filtering
• Encryption and compression supported
Short comings
• Limited to the database supported by them
• Inter-organizational /web access limited by
security concerns
7
5/24/2016
Files Files
• Simple and common way of exchanging / moving data
• All systems/ applications provide file based output • Inter-organizational data
• Data converted to files (CSV, TSV, XML, JSON) for data movement
• Media, text files can be easily stored in files • Work easily with inter-organizational boundaries • Media files
• Common tools / commands utilities can achieve
results
• Secure data encryption/
compression
Short comings
• Slow
• Too many manual steps/ stop points
• Data exposed unless properly encrypted
REST APIs REST APIs

• A web-based API standard for exchanging data and performing CRUD
operations • Standard for Internet Data exchange. • Cloud/ Social media data
• Excellent security and scalability sources
• Decouples consumers from producers
• Easy to integrate • Mobile data sources
• Stateless existence
• Real time meta data
• Uniform interface across sources : GET, POST, PUT, DELETE
• Support advanced security (Oauth2) and encryption Short comings
• Support by most cloud and mobile data sources (Twitter, Facebook, • Redundant information (since stateless)
Salesforce etc.) • Rate limitations by providers
• Does not support real time
Streaming Streaming
• Real time data subscribe / publish model
• Real time instantaneous data transfer • Real time sentiment
• Client subscribes to a specific topic/ sub-set of data analysis
• Can send only “diffs” – small footprint
• HTTP connection is kept “open” • Real time reporting
• Support by all major cloud providers ( Twitter,
• Server pushes data to client whenever new data is available Facebook, Salesforce) • Real time actions based on
• Uses secure keys and encryption user behavior
Short comings
• Loss of data if connection broken
• Rate limitations by providers
• Might need to be supplemented by historical
data pulls
8
5/24/2016
Types
• Store and Forward
• Receive from a source “at rest”
• Move data in units
• Track completion
Transport module • Retransmit if required
• Streaming
• Continuous moving data stream
• Throttle at source
• Throttle for sink
• Inflight storage
Responsibilities What to architect

• Maintain link with acquisition module • Speed
• Translate data to protocol optimal formats • Throttling
• Move data • Reliability of data (no loss in transport)
• Secure Data • Redundancy
• Maintain link with Persistence module • Scalability
• Save data in Persistence module (and confirm) • Status Reporting and alarming
• Track data as it moves • Compression
• Re-transport in case of failures • Encryption
• Reporting
Best Practices
• Do not reinvent the wheel
• Piggy back on proven messaging/transfer frameworks and protocols.
• Look for integrations between transport technologies with
acquisition, transformation and persistence technologies.
• Be aware of data transport costs
Transport - Options
• Use reliable inflight storage
• Consider security measures to prevent data theft.

9
5/24/2016
File Move / Copy File Move / Copy

• Simplest way of moving large files
• Simple and straight forward to use • Intra-enterprise
• Supported on all operating systems
• No special skills • Media Files
• Inter-operating system transfers would require adapters
• Can be quickly scheduled /automated
Short comings
• Inter- O/S moves require adapters
• WAN moves might have reliability issues
• Management of large files and file sets
difficult
SFTP ( Secure File Transfer Protocol) SFTP

• Network Protocol for File Access and Transfer
• Wide Support across O/S, tools and utilities • Inter-enterprise file sharing
• Uses a secure channel for data protection (SSH)
• Mature and widely accepted • Media file transfers
• Authentication/authorization (SSH)
• Data security across internet / VPN/ WAN • Log files
• Data integrity checks
• Can resume interrupted transfers
• Files carry basic source attributes like timestamps Short comings
• Firewalls need to be enabled for SFTP
• Wide support across O/S, tools and utilities
• Passwords need to be shared between parties
• Slower transfer speeds due to encryption
Apache Sqoop Apache Sqoop

• A Command Line tool for transferring data between relational
databases and Apache Hadoop • Simple and straight forward to use • Hadoop based backups
• Parallelism to speed up transfers and data warehouses
• Entire databases, tables or results of SQL
• Bi-directional • Move data to HBase/Hive
• Support for Avro, Sequence, Parquet or Plain text files
• Analyzed data from
• Hive and HBase Support Hadoop to RDBMS
• Parallelism Short comings
• Incremental transfers • Predominantly Linux
• BLOB support • Open security
• No Transformation support
• No streaming
10
5/24/2016
Apache Flume Apache Flume

• A distributed Service for collecting, aggregating and moving large amounts
of log/ streaming data • Configuration driven • Web log shipping
• Each origin has a source component to get events • Massively scalable • Twitter streaming
• A Channel used to transport data • Customizable through code
• A sink where the event is deposited • Edge server events
processing
• Sources can span a large number of servers
• Support for multiple sources ( files, strings, HTTP POSTs, twitter streams) Short comings
• Support for multiple sink types • No ordering guarantees
• Can add custom sources and sinks through code • Duplicates possible
• Robust, fault tolerant, has throttling, failover and recovery capabilities • No replication
• Inflight data processing through interceptors
Apache Kafka Apache Kafka

• An open source message broker platform for real time data feeds
• Highly scalable Real time messaging system • Real time analytics
• Publish – subscribe architecture.
• Multiple subscribers
• Developed at LinkedIn, written in Scala. • Operational metrics
• Ordering guarantees aggregation
• Topics are published. Multiple subscribers can be there for a topic
• Complex event processing
• Ordering guarantees
• Coding required for the publisher and the subscriber to interface with Short comings
Kafka • Coding required for publishers and subscribers
• Replication and high availability • Support (compared to flume)
Responsibilities
• Reliable data storage
• ACID (Atomicity, Consistency, Isolation, Durability)
• Schema
Persistence module • Transactional support
• Data Access
• Response times
• Scalability (multi-cluster, shared-nothing etc.)

11
5/24/2016
What to architect Best Practices

• Scalability • Horses for Courses / polyglot persistence
• Consistency • Keep schema /design flexible
• Transactions • Keep data at lowest granularity
• Read-intensive vs write-intensive • Summarize only if needed
• Mutable vs immutable data • Consider real time application needs
• cataloging (meta-data) • Do away with backups
• Latency: real time vs historical
• Standard vs adhoc loads
• Flexible Schema
RDBMS
• Still has a Big role in Big Data Architectures
• Stores data in Tables and Columns
• Optimized for numbers
Persist - Options • Excellent Query performance
• Limitations in scalability.
• Schema need to be predefined
• Few mature options –Oracle, MySQL, PostgreSQL

RDBMS HDFS
• A distributed file system that can span across hundreds of nodes
• Very mature technology • Meta Data
• Multiple copies of the same file eliminates need for backups
• Query Performance • Multi Update cases
• Can run on commodity servers
• 3rd Party / Tool support • Work In Progress Data
• Excellent ACID support • Resilient to node failures
• Summary data
• Open Source Apache project
Short comings • Results
• Allows for parallel execution of Map Reduce tasks
• Scalability in TB/PB
• Rigid Schema
• Cost
• Text storage
12
5/24/2016
HDFS Cassandra
• Wide Column store (Big Table)
• Massively scalable and reliable • Log files
• Open source developed by Facebook
• Allows parallel data processing • Media files (recordings)
• Dynamic Schema
• No backups needed • Online backup for RDBMS
• Very cost effective (compare to SAN) data • Decentralized architecture
• Single index for each table
Short comings • Excellent single-row query performance
• No indexes. Access can be very slow
• Bad range scan performance
• Security concerns
• No aggregation support
• Limited to Java programming
Cassandra MongoDB
• Document Oriented Database (JSON)
• Excellent scalability and performance • Customer 360
• Strong consistency
• Strong security • Monitoring Statistics and
analytics • Expressive Query Language
• Multiple writes with excellent performance
• Location based lookup • Multiple Indexes
• Aggregations
Short comings • Replication and failover
• Transaction support
• Master Slave Model
• Adhoc querying
• No support for Group By or Joins
Mongo DB Neo4j
• Graph oriented database
• Open Source • Store documents • Deals with Relationships – nodes and edges
• Multiple indexes • Write once read many data • ACID Compliant
• Strong and rich SQL Support stores
• • Transaction support
Ease of use • Real time analytics
• Cypher – Graph Query Language
• Possible RDBMS
Short comings • Easy to program complex joins
replacement
• Transaction support
• Fast relationship traversals
• No foreign keys
• No Joins
13
5/24/2016
Neo4j ElasticSearch
• A full text search-engine
• Transaction and ACID support • Master Data Management
• A distributed document store
• Excellent Query support • IT Network modeling
• Every field is indexed and searchable
• Fast traversal of nodes (join equivalents) • Social network modeling
• Maps easily to OO applications • Can scale hundreds of servers for structured and unstructured data
• Identity and Access
Management • Aggregation support
Short comings
• No built in user management
• Aggregation support
• Not suitable for data without a lot of
relationships
ElasticSearch
• Outstanding search capabilities • Not recommended as
• Aggregation support primary data store
• Flexible schema • Adhoc query building and
aggregation
• Real time analytics
Transformation module
Short comings
• No ACID support
• No SQL support
• Data loss risks
Responsibilities What to architect?

• Cleansing • Real time vs historical
• Filtering • Templates
• Unwanted, incomplete • DE normalization
• Standardization • Reprocessing
• Format, content
• Parallelism
• Enrichment • Setup
• ID-names mapping, bucketing, categorization • Code
• Integration • Speed
• Between data sources
• Work-In-Progress Storage
• Summarization / Pivoting
14
5/24/2016
Best Practices
• Keep Real time and historical separate for data > Tera bytes
• Map-Reduce
• Don’t reinvent the wheel
• Build template code /functions for enterprise known use cases
• Keep intermediate data for some time to enable reprocessing
Transform - Options
• Summarize only if really required. Keep data in original detail
• Build monitoring capabilities for performance

Custom Code Custom Code

• Write custom code in your favorite programming language
• Custom built to your needs and situations • Not recommended unless
• Build-it-yourself. Think before you got there your use case has no
• Easy integration with custom sources and
• Scalability readymade solutions
sinks
• Reliability
• Reuse existing computation code
• Parallelism
• People who built such actually ended up building products that we
use today Short comings
• Too much to build and maintain
• Long cycle times
• Heavy resource requirements
Hadoop Map-Reduce Hadoop Map-Reduce

• The first big data processing technology
• Parallelism helps handle huge data processing • Batch mode processing
• Code move to where data resides loads
• Text mining
• Mappers work in parallel on individual records and transform • Can easily handle text and flexible schema
• Data Cleansing & Filtering
• Reducers summarize and aggregate • Custom code for business functionality
• Analyzing media files
• Series of map-reducers work on a pipeline
• Uses cheap hardware with extreme parallelism Short comings
• Not suitable for real time
• Reducers can be choking points
• Need developers with Map-reduce thinking
skills
15
5/24/2016
*QL Query *QL Query

• Data Products today have some form of SQL support either native or
through other interface • Minimum effort, maximum returns • Filtering
• Hive, Impala • *QL engines are optimized for performance • Summarization
and parallelism
• Can use *QLs to do • Copying Data
• Filtering • Scripts would suffice
• Cleansing
• Transformation Short comings
• Summarization • Limited capabilities (without programming)
• Insert/update back to source • Combining different data sources /sinks is
• The SQL engine does the heavy lifting difficult
Apache Spark Apache Spark

• New Generation general data processing engine.
• Flexible, fast and powerful • Wide range of use cases
• Eliminates a number of short comings of traditional Map-Reduce
• Supports different types of processing • Interactive processing
• Works on data in memory and in distribution fashion capabilities
• Stream processing
• Supports Map-Reduce type operations, but a lot faster. • Can run with Hadoop
• De facto standard ?
• Can work on Streaming real time data
• Support for Java, Python, R Short comings
• Has SQL and Graph capabilities • Significant coding effort
• Immature
• Interactive processing capabilities
• Significant h/w resource requirements
ETL Products ETL Products

• Talend, Pentaho, Jaspersoft, Snaplogic etc.
• Easy to build flows • Any use case is supported
• Commercial and Open Source offerings on paper, devil in the
• Good integration with various data sources
• Visual ETL / pipeline builders with almost no coding and sinks details
• Can build flows from acquisition to transformation to storage • Management capabilities • Please try out the product
• Custom functions possible before jumping in
• Operational management available Short comings
• Can get complex pretty quickly
• Maturity in question
• Can get pricey for commercial offerings
• Inter-organizational flows can become tricky
16
5/24/2016
Responsibilities
• Canned Reports
• Do-it-yourself report designer
• Dashboard Designer
Reporting Module • API to extract data from the persistence layer
• Authentication and Authorization
• Real time data presentation
• Alerting

What to architect Best Practices

• Response times • Pick a tool with easy to use graphics capabilities
• Mobile and web access • Tool should have integrated with variety of data sources
• Personalization • Aggregate on the fly without compromising on performance
• Advanced Graphical capabilities • Use open standards for data access/integrations
• Threshold management • Provide for personalized dashboards
• Integration with other systems • Design for multiple interfaces – mobile, web, embedded
• Search • Search is a key capability today
Cloudera Impala
• In-memory distributed query engine for Hadoop
• Interactive Shell for Queries
• Very fast compared to Hive (no Map Reduce)
Reporting Options • Supports Joins, sub-queries, aggregations
• Supports Hadoop and HBase
• Available ODBC Drivers and Thrift APIs

17
5/24/2016
Cloudera Impala Spark SQL

• Provides SQL like programming – easy to use
• Very fast access for Hadoop data • Adhoc querying on Hadoop
• Internally implemented as Map-Reduce operations on Spark RDDs.
• Familiar SQL interface through an interactive • API interface for file
shell management applications • Very fast and flexible
• Strong integration with Hadoop • HBase data access • Supports aggregations and joins
• Machine Learning integration with Spark ML
Short comings • Can be used for interactive and stream programming
• No graphics
• No fault tolerance
• Support for other data stores
Spark SQL 3rd Party Tools

• A number of open source and commercial options exists for 3rd Party
• Rich set of capabilities • Programmed Querying of big data analytics tools
• Familiar syntax large datasets
• The choice would narrow down based on cost, familiarity and use
• Excellent performance scalability (Spark) • Single system for ETL, case matching
• Multiple languages – Java, Scala, Python Analytics and Advanced
Analytics • Options include Tableau, Pentaho, Jasper, QlikView, Birst
• Easy integration with other libraries through
Spark/Scala/ Java
• Real time analytics • Excellent Graphics capabilities
Short comings • Integration through native or ODBC/JDBC drivers
• No graphics • Visual Design capabilities
• Programming skills required • Authentication and authorization integrations
3rd Party tools Elastic

• ElasticSearch provides an excellent search engine on existing data
• Rich Graphics • Enterprise Dashboards and
Reports • Kibana provides visualization capabilities on ElasticSearch data
• Templates for Visualizations and Dashboards
• Multiple Data Sources • Aggregation capabilities
• Easy to use Designers
• Authentication and Authorization • Visual Designers • Streaming data support
• Customization • Graphics
• Scalability
Short comings • Search
• Cost
• Native support levels differ
18
5/24/2016
Elastic
• Rich Graphics • Enterprise Dashboards and
• Flexible Query capabilities Reports
• Real time analytics • Adhoc Query UIs
• Search • Real time Monitoring Advanced Analytics Module
Short comings
• Additional work populating ElasticSearch
• Accuracy issues
Types of Analytics Responsibilities

Type of Description • Model building capabilities
Analytics • Supervised and unsupervised
Understand what happened • Support for Validation techniques
Descriptive
• Ensemble algorithms
Exploratory Find out why something is happening
• Interactive development
Inferential Understand a population from a sample • Automation
Predictive Forecast what is going to happen • Predictions
Causal What happens to one variable when you change another
Deep Use of advanced techniques to understand large and multi-

source datasets
Copyright @2016 V2 Maestros, All rights reserved.
What to Architect Best Practices

• Scalability • Architecture aligned with methodology – work with data scientists
• Performance – especially predictions • Plan adhoc model building – make sure they do not affect other
• Validations – both model and predictions processing
• Algorithms – options, configurations, tuning • All Advanced Analytics projects don’t yield results
• Set expectations right
• Automation and operationalization • Choose easy projects initially
• Keep an eye on automation and operationalization
19
5/24/2016
R
• Language / Environment for Statistical Computing and Graphics
• Long history of use by Statisticians
• Wide package set for various machine learning algorithms
Advanced Analytics - Options • Data Cleansing, Transformation and Graphics capabilities
• RStudio – IDE for interactive programming
• Runs on data in memory
• Limited to the memory of the node where it runs

R Python
• Standard Programming language that has Big Data / Data Science
• Excellent set of machine learning algorithms • Interactive Model building related packages and capabilities
• Graphics and other presentation tools / Trials on small data sets • NumPy, SciPy, Pandas, Scikit-Learn
• Interactive model building capabilities • Small data science • Vast array of third party libraries
• Mature applications
• Data Cleansing and Graphics capabilities
• Presentations
Short comings • IDEs available for interactive programming
• Scaling limited to the local memory • Integration with Apache Spark
• R cannot be used to build robust application • Multi purpose language – can be used for other capabilities too
suites
• Big Data capabilities are limited
Python Apache Spark

• Apache Spark has ML Library – supports a good set of machine
• Graphics and Data Cleansing tools • Interactive Model building learning algorithms
• Interactive model building capabilities / Trials on small data sets
• ML library uses Data Frames from Spark SQL
• Integration with Apache Spark • Data Cleansing
• Standardized approach – similar usage for all algorithms
• Easier learning curve compared to R • Small Advanced Analytics
applications • ML algorithms scale horizontally across a cluster
Short comings • Can use Java, Scala or Python
• Scaling limited to the local memory • Interactive model building possible – can run on a windows desktop
• Limited ML implementations compared to R • Excellent integration with Big Data Sources
• Real time analytics / predictions ossible with Streaming
20
5/24/2016
Apache Spark Commercial Software

• Tableau, SAS, RapidMiner etc.
• Scalability • Predictive modeling for
• Interactive model building capabilities large datasets • Good set of algorithms
• Real time predictions • Model building on NoSQL • Good graphics support
• Support for various data sources and tools data sources • Can Scale
• Support for multiple programming languages
• Real time predictions • Can use various Data sources
Short comings • Very Pricey.
• No Graphics
• No IDE; only a shell for interactive
programming
• Limited algorithms /implementations
• Not mature; fast evolving
Use Case Description

• ABC Enterprise currently keeps 18 months of CRM data in RDBMS and
7 years of archived data on tapes.
Use Case 01 : • ABC Enterprise wants to move from tape backups to HDFS backups
• Access of data is easier
Enterprise Data Backup • Can use commodity hardware with potential to move to the cloud
• No offline backups required
• Provide adhoc querying capability on the data

Characteristics Enterprise Data Backup Architecture

Characteristic Type Notes
Sources RDBMS
Scheduler
Data Types Numeric and relational
Mode Historical
Source Sqoop Impala
Data Acquisition Pull HDFS
RDBMS
Availability (RT / HR) After 1 day Data needs to be available in the data warehouse after 1
day since the original data is created
Store type Write once, read many
Response Times As good as possible Given adhoc querying requirements, queries can run for a
few seconds.
Model building none
21
5/24/2016
Solution
Module Choice Notes
Acquire Sqoop Default choice for Database Extract

Use Case 02 :
Transport N/A
Media File Store
Persist HDFS Store in native HDFS format as Sequence Files
Transform N/A
Reporting Impala Basic adhoc querying tool

Advanced
Analytics
N/A punishable under law
Use Case Description Characteristics

• ABC Enterprise has contact center where all calls are recorded. These Characteristic Type Notes
recordings need to be archived for statutory reasons and analytics Sources Contact Center recording
solutions
• ABC Enterprise wants to move from tape archive to online archive
Data Types Media files
• Provide adhoc querying capability on the data Mode Historical
Data Acquisition Pull
Availability (RT / HR) After 1 day Data needs to be available in the media store after 1 day
since the original data is created
Response Times As good as possible Given adhoc querying requirements, queries can run for a
few seconds.
Model building none
Enterprise Data Backup Architecture Solution

Module Choice Notes
Scheduler HDFS
Acquire Files Only choice for media files
FTP Transport FTP Easy transfer; security and compression capable

Media file Media
source Analyzer Media
Reports Persist HDFS, MySQL Media files stored in HDFS; Meta-data and analytics stored in MySQL
Transform Custom Custom Media Analyzer for tagging media files and storing meta data
MySQL Custom Media Reporting tool to analyze meta data and listen to
Reporting Impala
recordings
Advanced
N/A
Analytics
22
5/24/2016

• XYZ news corporation tracks popular topics in social media and uses
them for their news reporting
Use Case 03 : • They want an automated system to capture social media interactions
on popular topics and do real time sentiment analysis
Social Media Sentiment Analysis • Sentiment Analysis need to be summarized and archived for future
analysis too.

Characteristics Sentiment Analysis Architecture

Sources Twitter, Facebook Social media popular topics. Topics are configurable
Data Types Tweets, posts (JSON)
Mode Real time
Data Acquisition Streaming / push Spark
Apache
Availability (RT / HR) Real time On the fly analytics Kafka Streaming Cassandra
Store type Write many, read many

Response Times Real time Given adhoc querying requirements, queries can run for a
few seconds. News
Topics
Model building Sentiment Analysis Summary
Configuration
Solution
Module Choice Notes
Acquire Streaming Streaming supported by all social media websites

Use Case 04 :
Transport Kafka
Kafka provides scalable real time transport for data. Has interfaces to
Twitter streaming as well as Spark Credit Card Fraud Detection
Persist Cassandra Store data by topic. The social media topic would be used as the key.
Transform Apache Spark Real time stream subscription and transformation
Reporting Custom Custom application for reading Cassandra data and summarizing for news
Advanced
Analytics
Apache Spark Sentiment Analysis on the fly with stream processing punishable under law
23
5/24/2016

• ABC Systems runs a web based retail solution where customers can Characteristic Type Notes
order any kind of products (like Amazon) Sources Internal /web Data is captured in real time while payment is being made
transactions on the web
• Sometimes credit card thieves use stolen CC information to make
Data Types Numeric / CRM
purchases. This later results in loss of revenue
Mode Real time / Historical Historical data collection; prediction in real time
• ABC systems needs a real time Credit Card Fraud prediction system so
Data Acquisition Streaming / push Data pushed from browser as transactions happen
that the purchase is blocked before its complete.
Availability (RT / HR) Real time Real time predictions
Response Times Minimal Prediction need to be made when the purchase is made.
Model building Binary Classification Model to predict if a transaction is fraudulent or not.
Credit Card Fraud Detection Solution

Module Choice Notes
Acquire Web Events Generated by custom web app. Deployed on a web farm
Apache
Kafka Kafka provides scalable real time transport for data. Web Transaction
Custom Transport Kafka
Web Fraud events from web app.
Payment Mongo DB Input
Web events/ transactions accumulated and stored in Mongo DB; Also
App Persist MongoDB
models built are stored in Mongo DB
Model Transform Spark

Building
Fraud
Prediction Apache Architecture can be enhanced to add adhoc reporting on the web
Reporting None
App Spark transactions if required.
Advanced
Apache Spark Binary Classification model building
Analytics

• ABC Systems runs a cloud based data center with 100s of VM nodes.
This data center needs to kept operational 24 X 7
Use Case 05 : • To help management, they want to setup a reporting /analytics
system that will provide the following
Operational Analytics • Real time node health monitoring
• Historical root cause analysis for problems
• Predict node failures
• They need you to architect a big data solution to achieve the same.

24
5/24/2016
Characteristics Operational Analytics

Apache
Sources Server Logs Server logs generated by VMs and applications
HDFS Spark
Data Types Text Log messages. Might have predefined structures
Web
Mode Real time node Log Msgs Apache
farm Flume
Data Acquisition Streaming / push
Spark Cassandra
Availability (RT / HR) Real time Real time monitoring needs real time feeds
Streaming
Store type Write many, read many
Response Times Real time Real time network health reporting required
Operations
Model building Classification Predict node failures Dashboard
Solution
Module Choice Notes
Acquire Log Files Log files from VMs and applications

Use Case 06 :
Transport Flume
Flume agent on all VMs help transport log files through to the Big data
layer
Cassandra stores summary/analytics data by each VM (index). HDFS
News Articles recommendations
Persist Cassandra, HDFS
stores raw log files
Real time stream subscription and transformation to compute real time
Transform Apache Spark
monitoring statistics; Historical log analysis for statistics
Use a good 3rd Party reporting solution for real time and historical
Reporting 3rd Party
reporting
Advanced
Analytics
Apache Spark Prediction for node failure. Information stored as part of the node record. punishable under law

• ABC news corporation hosts a website where users can read articles Characteristic Type Notes
about various day-to-day happenings Sources Web Click events Web clicks generated through Browser logs
• For every registered user who logs in, ABC news wants to provide a Data Types Text / URLs URLs that are clicked by the user
list of recommended news articles in real time (articles you might like) Mode Real time / Historical
• They want to build a web click analysis / prediction system Data Acquisition Push
Availability (RT / HR) Real time Real time user and item relationships
Store type Write many, read many Users, articles and relationships need to be stored.
Response Times Real time Real time recommendations / predictions
Model building Collaborative Filtering, Predict similar articles based on similar users
Association Rules
25
5/24/2016
News Article Recommendations Solution

Module Choice Notes
Apache
Kafka HDFS
Web Click /
Acquire Web Clicks from browser collected through a farm of Web Servers
Events
Apache
Spark Transport Kafka Kafka transports clicks from web servers to the centralized HDFS
Web
Browser
HDFS stores raw events; Neo4j stores nodes (users, articles) and
Persist Neo4j, HDFS
Custom relationships ( user-user, article-article, user-article)
Recomm Neo4j
Server Transform Apache Spark
Reporting N/a
Advanced Historical event analysis and generation of collaborative filtering and

Apache Spark
Analytics association rules results

• XYZ Corporation produces and sells computers and accessories.
• It want to track and manage all information about their customers
and create a Customer 360.
• Web site activity
Use Case 07 : Customer 360 •
•
Purchase history
Issues / Contact Center history
• Social Media
• It then wants to use this information for prospective selling
• When customer visits website
• When customer calls contact center
Characteristics Customer 360

Sources Web site, CRM, Contact Multiple sources of data Browser
Center, social media
Apache
Data Types Numbers, Text, All types /kinds of data used Kafka HDFS
Recordings Social
Media
Mode Historical Processing of data is historical.
Spark Cassandra
Data Acquisition Push & Pull Depending on data
Availability (RT / HR) Historical Historical data fetch and data integration Scheduler
CRM
Store type Write many, read many Store transaction history as well as customer 360 Mongo Custom
Sqoop DB Reco.
Response Times Real time Real time profile fetches for real time web marketing Contact Server
Center
Model building Classification, Clustering, Predict similar articles based on similar users
Collaborative Filtering
26
5/24/2016
Solution
Module Choice Notes
Acquire
Web Clicks,
Transactions, SM Data
Multiple sources; different types
Use Case 08 :
Transport Kafka and Sqoop
HDFS, Mongo DB,

Kafka for event streaming; Sqoop for fetching from RDBMS
HDFS stores raw events; Mongo DB for transactions; Cassandra for

IoT - The Connected Car
Persist
Cassandra Customer Profiles
Transform Apache Spark Ingest from multiple sources and create profile
Data in Mongo DB and Cassandra can be used for Customer

Reporting 3rd Party
Analytics UIs /Dashboards
Advanced
Analytics
Apache Spark Generate Customer classification based on multi-source data punishable under law

• PQR Car company wants to connect cars in real time to analytics Characteristic Type Notes
engine Sources Car sensors Sensors in car
• Cars have multiple sensors. Sensor data need to be analyzed (real Data Types Numbers Numeric event sensor data
time / historical) to generate alarms for possible failures to the driver Mode Historical / Real time Critical data processed real time. Rest historical
• PQR needs a satellite enabled data collection and alarm system Data Acquisition Push Sensors send data to collection centers
backed by a big data infrastructure Availability (RT / HR) Real time Real time alarms needed
Store type Write many, read many Car profile need to be stored
Response Times Real time Real time profile fetches for real time alarming
Model building Car issue prediction Predict possible future issues
IoT : The connected car Solution

Apache Module Choice Notes
HDFS Spark
Car Data Events from Car
Sensors Collection Apache Acquire Embedded in every car
Events Sensors
Collectors nodes Flume
Events sent through satellite/mobile to collection centers; Kafka
Transport Custom and Kafka
then transports from the centers to the big data repository
Spark
CAR
Streaming Persist HDFS, Cassandra HDFS stores raw events; Cassandra stores car profiles
Car
Alarming Cassandra Transform Apache Spark Ingest from multiple sources and create car profile
System
Alarms
Server
Reporting Custom Alarm system inside car
Advanced
Apache Spark Generate predictions based on car profiles
Analytics
27
5/24/2016
Transitioning to Big Data

• Start with new use cases
• Advanced Analytics
• Social Media
• Mobile
Transitioning to Big Data • Leave what works alone for now
• Choose simple use cases with quick results to convince others
• Work on Proof-of-concept projects
• Be careful about advanced analytics results
• Build sand boxes and move them gradually to production
• Think scalability always
28

Architecting Big Data Solutions Handouts PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Architecting Big Data Solutions Handouts PDF

Transféré par

Droits d'auteur :

Formats disponibles

5/24/2016

Architecting Big Data Solutions Big Data Overview

Traditional Data Characteristics

Traditional Data Processing Traditional Solutions Architectures

Traditional Data Challenges

Unauthorized copying, distribution and exhibition of this presentation is

What is Big Data? What lead to Big Data

What is a Big Data Application? Characteristics of Big Data Products

Unauthorized copying, distribution and exhibition of this presentation is

Software Product Organizations Enterprise IT

Big Data Solution

What is a Big Data Solution ? Traditional vs Big Data Solutions

Applications 3-Tier – business layer centered Data centered, integration oriented

Usage Reporting, Analytics, Statutory Analytics, Machine Learning, Predictive &

Historical vs Real Time

Completed records Live records with updates

Modules of a Big Data Solution Big Data Architecture Template

About Technology options

Unauthorized copying, distribution and exhibition of this presentation is

• Acquire most suitable components

Too many options

Unauthorized copying, distribution and exhibition of this presentation is

Immature and incomplete Not future safe

Source Types What to architect?

Unauthorized copying, distribution and exhibition of this presentation is

SQL Query SQL Query

REST APIs REST APIs

Responsibilities What to architect

Unauthorized copying, distribution and exhibition of this presentation is

File Move / Copy File Move / Copy

SFTP ( Secure File Transfer Protocol) SFTP

Apache Sqoop Apache Sqoop

Apache Flume Apache Flume

Apache Kafka Apache Kafka

Unauthorized copying, distribution and exhibition of this presentation is

What to architect Best Practices

Unauthorized copying, distribution and exhibition of this presentation is

Responsibilities What to architect?

Unauthorized copying, distribution and exhibition of this presentation is

Custom Code Custom Code

Hadoop Map-Reduce Hadoop Map-Reduce

*QL Query *QL Query

Apache Spark Apache Spark

ETL Products ETL Products

Unauthorized copying, distribution and exhibition of this presentation is

What to architect Best Practices

Unauthorized copying, distribution and exhibition of this presentation is

Cloudera Impala Spark SQL

Spark SQL 3rd Party Tools

3rd Party tools Elastic

Types of Analytics Responsibilities

Causal What happens to one variable when you change another

Deep Use of advanced techniques to understand large and multi-

What to Architect Best Practices

Unauthorized copying, distribution and exhibition of this presentation is

Python Apache Spark

Apache Spark Commercial Software

Use Case Description

Unauthorized copying, distribution and exhibition of this presentation is

Characteristics Enterprise Data Backup Architecture

Acquire Sqoop Default choice for Database Extract

Reporting Impala Basic adhoc querying tool

Use Case Description Characteristics

QL Query QL Query