Académique Documents
Professionnel Documents
Culture Documents
Unauthorized copying, distribution and exhibition of this presentation is Unauthorized copying, distribution and exhibition of this presentation is
punishable under law punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
1
5/24/2016
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
2
5/24/2016
Technologies
• Numerous companies / projects focused on Big Data technologies
• Mainly open source
• Cloud focused
Current Big Data Trends • Focus on “one thing” with open interfaces for integrations
• Phenomenal growth in adoption spurred by startup culture
• Number of immature alternatives in each segment
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Unauthorized copying, distribution and exhibition of this presentation is Unauthorized copying, distribution and exhibition of this presentation is
punishable under law punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
3
5/24/2016
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
4
5/24/2016
Traditional way – Build from scratch Big Data way – assemble and stitch
• Monolithic applications – home grown or brought • Big data processing has 2 common demands – scalability and
• Single programming language – 1000s of LOC written reliability
• Single data store • A number of products / technologies available especially open source
• High development / maintenance costs • They support excellent open integrations
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
5
5/24/2016
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
What to expect in the next 5-10 years? Making investments future safe
• Few products would grow and become the leaders • Look for product and developer support
• Merging of products • Look at cloud options
• Fewer more mature options • Adaptions by leading companies / products
• Stable features • Open APIs and data formats
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Responsibilities
• Connect / maintain connection to the source
• Execute protocol responsibilities ( reconnects, handshakes, error
handling)
• Data Format conversion
Acquisition Module • Filtering
• Local caching
• Compression
• Encryption
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
6
5/24/2016
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Best Practices
• Involve source owners to establish good handshakes
• Identifying new data
• Identifying missing data and retransmission
• Go for reliable Open APIs
• Native APIs/Formats should be standardized as early as possible Acquire - options
• Real-time vs historical – consider separate channels
• Pay attention to security and privacy
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
7
5/24/2016
Files Files
Advantages Use Cases
• Simple and common way of exchanging / moving data
• All systems/ applications provide file based output • Inter-organizational data
• Data converted to files (CSV, TSV, XML, JSON) for data movement
• Media, text files can be easily stored in files • Work easily with inter-organizational boundaries • Media files
• Common tools / commands utilities can achieve
results
• Secure data encryption/
compression
Short comings
• Slow
• Too many manual steps/ stop points
• Data exposed unless properly encrypted
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Streaming Streaming
Advantages Use Cases
• Real time data subscribe / publish model
• Real time instantaneous data transfer • Real time sentiment
• Client subscribes to a specific topic/ sub-set of data analysis
• Can send only “diffs” – small footprint
• HTTP connection is kept “open” • Real time reporting
• Support by all major cloud providers ( Twitter,
• Server pushes data to client whenever new data is available Facebook, Salesforce) • Real time actions based on
• Uses secure keys and encryption user behavior
Short comings
• Loss of data if connection broken
• Rate limitations by providers
• Might need to be supplemented by historical
data pulls
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
8
5/24/2016
Types
• Store and Forward
• Receive from a source “at rest”
• Move data in units
• Track completion
Transport module • Retransmit if required
• Streaming
• Continuous moving data stream
• Throttle at source
• Throttle for sink
• Inflight storage
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Best Practices
• Do not reinvent the wheel
• Piggy back on proven messaging/transfer frameworks and protocols.
• Look for integrations between transport technologies with
acquisition, transformation and persistence technologies.
• Be aware of data transport costs
Transport - Options
• Use reliable inflight storage
• Consider security measures to prevent data theft.
9
5/24/2016
Short comings
• Inter- O/S moves require adapters
• WAN moves might have reliability issues
• Management of large files and file sets
difficult
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
10
5/24/2016
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Responsibilities
• Reliable data storage
• ACID (Atomicity, Consistency, Isolation, Durability)
• Schema
Persistence module • Transactional support
• Data Access
• Response times
• Scalability (multi-cluster, shared-nothing etc.)
11
5/24/2016
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
RDBMS
• Still has a Big role in Big Data Architectures
• Stores data in Tables and Columns
• Optimized for numbers
Persist - Options • Excellent Query performance
• Limitations in scalability.
• Schema need to be predefined
• Few mature options –Oracle, MySQL, PostgreSQL
RDBMS HDFS
Advantages Use Cases
• A distributed file system that can span across hundreds of nodes
• Very mature technology • Meta Data
• Multiple copies of the same file eliminates need for backups
• Query Performance • Multi Update cases
• Can run on commodity servers
• 3rd Party / Tool support • Work In Progress Data
• Excellent ACID support • Resilient to node failures
• Summary data
• Open Source Apache project
Short comings • Results
• Allows for parallel execution of Map Reduce tasks
• Scalability in TB/PB
• Rigid Schema
• Cost
• Text storage
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
12
5/24/2016
HDFS Cassandra
Advantages Use Cases
• Wide Column store (Big Table)
• Massively scalable and reliable • Log files
• Open source developed by Facebook
• Allows parallel data processing • Media files (recordings)
• Dynamic Schema
• No backups needed • Online backup for RDBMS
• Very cost effective (compare to SAN) data • Decentralized architecture
• Single index for each table
Short comings • Excellent single-row query performance
• No indexes. Access can be very slow
• Bad range scan performance
• Security concerns
• No aggregation support
• Limited to Java programming
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Cassandra MongoDB
Advantages Use Cases
• Document Oriented Database (JSON)
• Excellent scalability and performance • Customer 360
• Strong consistency
• Strong security • Monitoring Statistics and
analytics • Expressive Query Language
• Multiple writes with excellent performance
• Location based lookup • Multiple Indexes
• Aggregations
Short comings • Replication and failover
• Transaction support
• Master Slave Model
• Adhoc querying
• No support for Group By or Joins
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Mongo DB Neo4j
Advantages Use Cases
• Graph oriented database
• Open Source • Store documents • Deals with Relationships – nodes and edges
• Multiple indexes • Write once read many data • ACID Compliant
• Strong and rich SQL Support stores
• • Transaction support
Ease of use • Real time analytics
• Cypher – Graph Query Language
• Possible RDBMS
Short comings • Easy to program complex joins
replacement
• Transaction support
• Fast relationship traversals
• No foreign keys
• No Joins
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
13
5/24/2016
Neo4j ElasticSearch
Advantages Use Cases
• A full text search-engine
• Transaction and ACID support • Master Data Management
• A distributed document store
• Excellent Query support • IT Network modeling
• Every field is indexed and searchable
• Fast traversal of nodes (join equivalents) • Social network modeling
• Maps easily to OO applications • Can scale hundreds of servers for structured and unstructured data
• Identity and Access
Management • Aggregation support
Short comings
• No built in user management
• Aggregation support
• Not suitable for data without a lot of
relationships
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
ElasticSearch
Advantages Use Cases
• Outstanding search capabilities • Not recommended as
• Aggregation support primary data store
• Flexible schema • Adhoc query building and
aggregation
• Real time analytics
Transformation module
Short comings
• No ACID support
• No SQL support
• Data loss risks
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
14
5/24/2016
Best Practices
• Keep Real time and historical separate for data > Tera bytes
• Map-Reduce
• Don’t reinvent the wheel
• Build template code /functions for enterprise known use cases
• Keep intermediate data for some time to enable reprocessing
Transform - Options
• Summarize only if really required. Keep data in original detail
• Build monitoring capabilities for performance
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
15
5/24/2016
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
16
5/24/2016
Responsibilities
• Canned Reports
• Do-it-yourself report designer
• Dashboard Designer
Reporting Module • API to extract data from the persistence layer
• Authentication and Authorization
• Real time data presentation
• Alerting
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Cloudera Impala
• In-memory distributed query engine for Hadoop
• Interactive Shell for Queries
• Very fast compared to Hive (no Map Reduce)
Reporting Options • Supports Joins, sub-queries, aggregations
• Supports Hadoop and HBase
• Available ODBC Drivers and Thrift APIs
17
5/24/2016
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
18
5/24/2016
Elastic
Advantages Use Cases
• Rich Graphics • Enterprise Dashboards and
• Flexible Query capabilities Reports
• Real time analytics • Adhoc Query UIs
• Search • Real time Monitoring Advanced Analytics Module
Short comings
• Additional work populating ElasticSearch
• Accuracy issues
Unauthorized copying, distribution and exhibition of this presentation is
punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
19
5/24/2016
R
• Language / Environment for Statistical Computing and Graphics
• Long history of use by Statisticians
• Wide package set for various machine learning algorithms
Advanced Analytics - Options • Data Cleansing, Transformation and Graphics capabilities
• RStudio – IDE for interactive programming
• Runs on data in memory
• Limited to the memory of the node where it runs
R Python
Advantages Use Cases
• Standard Programming language that has Big Data / Data Science
• Excellent set of machine learning algorithms • Interactive Model building related packages and capabilities
• Graphics and other presentation tools / Trials on small data sets • NumPy, SciPy, Pandas, Scikit-Learn
• Interactive model building capabilities • Small data science • Vast array of third party libraries
• Mature applications
• Data Cleansing and Graphics capabilities
• Presentations
Short comings • IDEs available for interactive programming
• Scaling limited to the local memory • Integration with Apache Spark
• R cannot be used to build robust application • Multi purpose language – can be used for other capabilities too
suites
• Big Data capabilities are limited
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
20
5/24/2016
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
21
5/24/2016
Solution
Module Choice Notes
Transform N/A
recordings need to be archived for statutory reasons and analytics Sources Contact Center recording
solutions
• ABC Enterprise wants to move from tape archive to online archive
Data Types Media files
• Provide adhoc querying capability on the data Mode Historical
Data Acquisition Pull
Availability (RT / HR) After 1 day Data needs to be available in the media store after 1 day
since the original data is created
Store type Write once, read many
Response Times As good as possible Given adhoc querying requirements, queries can run for a
few seconds.
Model building none
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Scheduler HDFS
Acquire Files Only choice for media files
Transform Custom Custom Media Analyzer for tagging media files and storing meta data
MySQL Custom Media Reporting tool to analyze meta data and listen to
Reporting Impala
recordings
Advanced
N/A
Analytics
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
22
5/24/2016
Solution
Module Choice Notes
Reporting Custom Custom application for reading Cassandra data and summarizing for news
Unauthorized copying, distribution and exhibition of this presentation is
Advanced
Analytics
Apache Spark Sentiment Analysis on the fly with stream processing punishable under law
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
23
5/24/2016
order any kind of products (like Amazon) Sources Internal /web Data is captured in real time while payment is being made
transactions on the web
• Sometimes credit card thieves use stolen CC information to make
Data Types Numeric / CRM
purchases. This later results in loss of revenue
Mode Real time / Historical Historical data collection; prediction in real time
• ABC systems needs a real time Credit Card Fraud prediction system so
Data Acquisition Streaming / push Data pushed from browser as transactions happen
that the purchase is blocked before its complete.
Availability (RT / HR) Real time Real time predictions
Store type Write once, read many
Response Times Minimal Prediction need to be made when the purchase is made.
Model building Binary Classification Model to predict if a transaction is fraudulent or not.
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Acquire Web Events Generated by custom web app. Deployed on a web farm
Apache
Kafka Kafka provides scalable real time transport for data. Web Transaction
Custom Transport Kafka
Web Fraud events from web app.
Payment Mongo DB Input
Web events/ transactions accumulated and stored in Mongo DB; Also
App Persist MongoDB
models built are stored in Mongo DB
24
5/24/2016
Solution
Module Choice Notes
about various day-to-day happenings Sources Web Click events Web clicks generated through Browser logs
• For every registered user who logs in, ABC news wants to provide a Data Types Text / URLs URLs that are clicked by the user
list of recommended news articles in real time (articles you might like) Mode Real time / Historical
• They want to build a web click analysis / prediction system Data Acquisition Push
Availability (RT / HR) Real time Real time user and item relationships
Store type Write many, read many Users, articles and relationships need to be stored.
Response Times Real time Real time recommendations / predictions
Model building Collaborative Filtering, Predict similar articles based on similar users
Association Rules
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
25
5/24/2016
Reporting N/a
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
26
5/24/2016
Solution
Module Choice Notes
Acquire
Web Clicks,
Transactions, SM Data
Multiple sources; different types
Use Case 08 :
Transport Kafka and Sqoop
Transform Apache Spark Ingest from multiple sources and create profile
• Cars have multiple sensors. Sensor data need to be analyzed (real Data Types Numbers Numeric event sensor data
time / historical) to generate alarms for possible failures to the driver Mode Historical / Real time Critical data processed real time. Rest historical
• PQR needs a satellite enabled data collection and alarm system Data Acquisition Push Sensors send data to collection centers
backed by a big data infrastructure Availability (RT / HR) Real time Real time alarms needed
Store type Write many, read many Car profile need to be stored
Response Times Real time Real time profile fetches for real time alarming
Model building Car issue prediction Predict possible future issues
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
Advanced
Apache Spark Generate predictions based on car profiles
Analytics
Copyright @2016 V2 Maestros, All rights reserved. Copyright @2016 V2 Maestros, All rights reserved.
27
5/24/2016
28