Vous êtes sur la page 1sur 41

Oracle Data Intelligence

Platform
Accelerating Data Science through Machine
Learning

Big Data PaaS Team


Feb, 2017
Machine Learning is Powering Next Generation
Analytics
From Descriptive To Prescriptive
Who, what, where and How and why
when Guided and open-ended
Curated interactive discovery
dashboards
Predicting the future
Understanding the past
By 2018 more than half of all large organizations
around the world will use advanced analytics to
compete
-- Gartner, Feb. 16
We wouldnt be looking at Oracle for analytics if
you didnt have the predictive piece.
-- Oracle 2000 customer, Nov. 16

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
ML Strategy
By 2020, predictive analytics will attract 40% of
enterprises' net new investment in business intelligence
and analytics (Gartner)
By 2018, more than half of large organizations globally will
compete using advanced analytics and algorithms, causing
the disruption of entire industries (Gartner)
Oracles ML strategy is to invest in a platform that
furnishes the entire lifecycle of advanced analytics
targeting the Oracle enterprise business customer and
specifically enabling what Gartner call the Citizen Data
Scientist
Copyright 2017 Oracle and/or its affiliates. All rights reserved.
Oracle
| Confidential: Highly Restricted
Machine Learning in Oracle Data Intelligence
Platform

Data Insights Recommendations Predictions


Find hidden factors that Repair: quickly fix common Data science for non-data-
influence outcomes problems with your data scientists
Data science API sifts through Enrich: improve data in Point-and-click visual UX
potential models and selects unanticipated ways guiding normal analysts
the best through model development,
Accelerate: drive UX that evaluation, and tuning
Automatic visualizations pick
the right charts for your data makes complex tasks one-
click Robust Predict API layer
Export model code to
Copyright 2017 Oracle and/or its affiliates. All rights reserved.
notebooks
|
Oracle
or expose model
Confidential: Highly Restricted
as a REST endpoint
Machine Learning is More than Just Algorithms..

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Oracle Data Intelligence Platform

Data
Continuum
Ingest Explor Prepar
Flow Model Serve
e e
Machine Learning

Store Search Manage

Oracle Confidential: Highly Restricted


Copyright 2017 Oracle and/or its affiliates. All rights reserved. |
Product Demos
1. Recommendation Demo
2. Predict Demo

Copyright 2014, Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Recommendation Demo: Sales Data Mart
Construction
Goal is to build a dataflow pipeline to load an existing data
mart model
Guide the pipeline construction through ML-based
recommendations
First discover and recommend dimensions related to the
cube (join recommendations)
Then find matching tables in the source database to
populate the dimensions (auto-map recommendations)

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Prediction Demo: Student Attrition
How a dataset of Student performance can be used to build
a model that provides insights into reasons why college
students drop out of school
How features are engineered and selected, both manually
and automatically
What, and Why features contribute to students dropping out
of school & insight into the how these contributing factors
are related
An end-to-end environment (DIP) from ingest, through
recommendations, to transform of data, to model building

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Product Architecture

Copyright 2014, Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Predict Architecture

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Automap Service Architecture
Search Index

Lambda App Primary Statistical


Parser Search Profiles
System Facade

Automap API Build Search Feature


UI Query Extraction

Column Logistic
Final Sort
Mapping Regression
Automap
Model Model

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Q&A

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
APPENDIX

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
1. Spare Intro Slides

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Nomenclature
Learning types differ, but all supervised learning starts with a
labeled training dataset.

In this case, species is the label (or target) and sepal/petal


length/widths are the possible features.
Can we predict Species for a new Iris flower from the same
measurements?
Supervised vs UnsupervisedCopyright 2017 Oracle and/or its affiliates. All rights reserved. |
Making Predictions
By building a model of our data, using supervised
classification techniques we can measure how effective the
model is at predicting an Iris flower species

Copyright 2017 Oracle and/or its affiliates. All rights reserved. |


Making Predictions
What if we didnt have the species name?
This type of system is called unsupervised learning because
there is no target or label attribute upon which to build and
evaluate a model

versicolor is more challenging to separate

Copyright 2017 Oracle and/or its affiliates. All rights reserved. |


What about Deep Learning?
A new name for an old technique running on MUCH better
hardware
Tons of recent innovation from academia and industry here
Offers breakthroughs in accuracy for many cognitive tasks
Vision, translation, speech recognition,
Hard to interpret, tune, and manage--leaves a billion-dollar
opening for tech companies to abstract customers from
complexity.

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
2. Spare DIP ML Slides

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Data Insights
Find the hidden factors that influence an outcome
What drives sales in North America?
Gives a warm introduction to unfamiliar dataa launching point for
exploration
Data science API sifts through potential models and selects
the best
More advanced than Watson Analytics or BeyondCore (SFDC)
Automatic visualizations pick the right charts for your data

Confidential Oracle Internal/Restricted/Highly


Copyright 2017 Oracle and/or its affiliates. All rights reserved. | 22
Restricted
Data Insights

Confidential Oracle Internal/Restricted/Highly


Copyright 2017 Oracle and/or its affiliates. All rights reserved. | 23
Restricted
Recommendations
Repair: quickly fix common problems with your data
Correct types, obfuscate sensitive fields, cluster mismatched values
Enrich: improve data in unanticipated ways
Suggest new datasets to blend in (data you may like)
Augment data with ontologies (general, domain, or custom)
Accelerate: drive UX that makes complex tasks one-click
Joining a new dataset
Mapping fields across datasets

Confidential Oracle Internal/Restricted/Highly


Copyright 2017 Oracle and/or its affiliates. All rights reserved. | 24
Restricted
Recommendations

Confidential Oracle Internal/Restricted/Highly


Copyright 2017 Oracle and/or its affiliates. All rights reserved. | 25
Restricted
Recommendations

Confidential Oracle Internal/Restricted/Highly


Copyright 2017 Oracle and/or its affiliates. All rights reserved. | 26
Restricted
Prediction
Data science for non-data-scientists (Gartners citizen data
scientist)
Closest competitors: AzureML, Dataiku, DataRobot
Point-and-click visual UX guiding normal analysts through
model development, evaluation, and tuning
Predict API handles:
feature extraction & selection
model and hyperparameter search
analysis of model evaluation results
Export model code to notebooks or expose model as a REST
endpoint Confidential Oracle Internal/Restricted/Highly
Copyright 2017 Oracle and/or its affiliates. All rights reserved. |
Restricted
27
Prediction

Confidential Oracle Internal/Restricted/Highly


Copyright 2017 Oracle and/or its affiliates. All rights reserved. | 28
Restricted
3. Spare Architecture Slides

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
High Level Logical Architecture

Connector Enrichment Explore Model

Ingest Recommendatio
Discover Interpret
ns

Metadata
Transform Visualize Host Models
discovery

Predict

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
4. Misc. Spare Slides

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Recommendation Engine for Dataflow
Goal is to guide user to build data pipelines using ML-based
recommendations
Increases productivity by anticipating and suggesting the
next logical steps in the pipeline
Currently provide two types of recommendations:
Auto-map
Join (related entities)

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Automap Recommendations
During dataflow pipeline design try to find best source or
target match for a particular entity
An entity can be a database table, a file, a stream of data, or
even a result of a series of transformations (pipeline steps)
Finding a match manually among the very large set of
entities in Data Lake would be extremely time consuming
and inefficient
System recommends matched entities (with confidence
score) as well as attribute/column level mapping

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
ML-Based Approach - Automap
Extract features from metadata and statistically profiled sampled data
Train ML model (classifier) to determine whether two entities match or
not
Advantages:
Provide accurate recommendations even when metadata is not available or
meaningful
Learn from examples
Source Similar Histograms Target
Capture user behavior/knowledge
- field_1 - col3
- filed_2 - col1
- field_3 - col2
- field_4 - col5
Similar Statistics
- field_5 - col4

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Vs. Classical Approach - Automap
Based on ad-hoc rule-based models
Trying to directly match metadata strings and apply some
heuristic rules
Source Target
Hypernyms
- carID - TruckID
Synonyms
- Brand - Make
Equality
- Price - Price
- SoldTo Soundex - Sold2
- CAddress Fuzzy Match - CustomerAddress

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Join Recommendations
Discover and recommend join relationships between entities
Multiple use cases:
Enrich streaming data
Data mart cube building (discover related dimensions)
Forward engineering (join related source tables)
Two complementary techniques
Use foreign key metadata (when available)
ML-based approach using statistical profiling information

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Statistical Profiling Infrastructure
Comprehensive statistics collected
Count, min, max, mean, range, std, iqr, skewness, kurtosis
Percentiles, histogram (bucket & value), mode, uniqueness
Missing, invalid

Statistics profiled based on various data types


Numeric: Discrete or Continuous
Textual: Category or Descriptive Input RDD

Asynchronous execution support


Numeric Type
Statistics profiling triggered by ingest of data Textual

Status publishing during the profiling (i.e., started, running,


success, failed )
Continuous Discrete Categorical Descriptive

Support for both batch and streaming data Statistics Profiling Manager for Columns

Profiler scheduler for data streams

Auto sample size adjustment


Streaming Profiler
Control error rate (~1%) based on data distribution Statistics
Data stream Data Slices
Implementation with akka actor model
Support concurrent columns profiling
Copyright 2017 Oracle and/or its affiliates. All rights reserved.
Oracle
| Confidential: Highly Restricted
ML-Based Join Recommendations
Search Index

Lambda App Primary Statistical


Parser Search Profiles
System Facade

Related API Build Search Column


Uniqueness
Levenshtein
Mapping Distance
UI Query Model
Score
Score

Related Mapping Histogram


Final Sort Model Overlap Score

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Tech Stack: Spark, Spark and More Spark
Single tech for data transformation, streaming, and ML
Ease of deployment, management, scaling, and development
Streaming and ML have specialized tech that we continue to
evaluate
Trusted for fast and scalable performance at web scale
Attracting immense investment-lets us focus on product
engineering

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
What about Deep Learning?
Deep learning is a branch of the broader discipline of machine learning, deep
learning models a biological system by creating a simulated software network of
mathematical neurons.
AI differs from ML in the type of applications it targets, but the future will see these
merge together. Some example AI uses are:
Machine translation
Robotics
Self healing system
Virtual Assistants
Compliments existing ML by automating feature extraction of data with a high level
abstraction - e.g. Images, Video, Audio, Language.
Tractica forecasts that annual software revenue for enterprise applications of deep
learning will increase from $109 million in 2015 to $10.4 billion in 2024

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted
Machine Learning Changes Business
Intelligence
Business intelligence has gone from reports that tell you what happened, to
interactive dashboards
From what happened to understanding why it happened
By 2018 more than half of all large organizations around the world will use
advanced analytics to compete Gartner Feb 2016
A few Examples of Enterprise ML in use today:
Churn prediction
customer lifetime value
cross-selling opportunities
likelihood of buying
credit scoring
fraud detection

Copyright 2017 Oracle and/or its affiliates. All rights reserved.


Oracle
| Confidential: Highly Restricted

Vous aimerez peut-être aussi