Vous êtes sur la page 1sur 212

Deep Dive into Data Science

with
KNIME Analytics Platform
KNIME AG

Copyright © 2019 KNIME AG


Overview
KNIME Analytics Platform

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 12 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Do you recognize this?

https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 3 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Let’s unroll it!

Model
Data Model Model
Optimizatio Deployment
Preparation Training Evaluation
n

It always starts
with some data

Data Manipulation Model Training Parameter Tuning Performance Measures Files & DBs
Data Blending Bag of Models Parameter Optimization Accuracy Dashboards
Missing Values Handling Model Selection Regularization ROC Curve REST API
Feature Generation Ensemble Models Model Size Cross-Validation SQL Code Export
Dimensionality Reduction Own Ensemble Model No. Iterations … Reporting
Feature Selection External Models … …
Outlier Removal Import Existing Models
Normalization Model Factory
Partitioning …

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 4 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
The many Lives of a Dataset

Model
Data Model Model
Optimizatio Deployment
Preparation Training Evaluation
n

Original Partitioning: Training Set Validation Set Test Set New Data from Real
Data Set • Training Set World Applications
with Past • Validation Set
Observation • Test Set
s

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 5 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
What is KNIME Analytics Platform?

• A tool for data analysis, manipulation, visualization, and reporting


• Based on the graphical programming paradigm
• Provides a diverse array of extensions:
– Text Mining
– Network Mining
– Cheminformatics
– Many integrations,
such as Java, R, Python,
Weka, Keras, Plotly, H2O, etc.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 56 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Visual KNIME Workflows

NODES perform tasks on data


Not Configured
Configured
Inputs Outputs Executed
Status Error

Nodes are combined to create


WORKFLOWS

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 67 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Access

• Databases
– MySQL, PostgreSQL
– any JDBC (Oracle, DB2, MS SQL
Server)
• Files
– CSV, txt
– Excel, Word, PDF
– SAS, SPSS
– XML
– PMML
– Images, texts, networks, chem
• Web, Cloud
– REST, Web services
– Twitter, Google

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 78 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Big Data

• Spark
• HDFS support
• Hive
• Impala
• In-database processing

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 89 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Transformation

• Preprocessing
– Row, column, matrix based
• Data blending
– Join, concatenate, append
• Aggregation
– Grouping, pivoting, binning
• Feature Creation and Selection

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 9
10 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Analysis & Data Mining

• Regression
– Linear, logistic
• Classification
– Decision tree, ensembles, SVM,
MLP, Naïve Bayes
• Clustering
– k-means, DBSCAN, hierarchical
• Validation
– Cross-validation, scoring, ROC
• Deep Learning
– Keras, DL4J
• External
– R, Python, Weka, H2O, Keras

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 10
11 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Visualization

• Interactive Visualizations
• JavaScript-based nodes
– Scatter Plot, Box Plot, Line Plot
– Networks, ROC Curve, Decision
Tree
– Plotly Integration
– Adding more with each release!
• Misc
– Tag cloud, open street map,
molecules
• Script-based visualizations
– R, Python

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 11
12 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Deployment

• Database
• Files
– Excel, CSV, txt
– XML
– PMML
– to: local, KNIME Server, SSH-,
FTP-Server
• BIRT Reporting

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 12
13 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Over 2000 Native and Embedded Nodes Included:

Data Access Transformation Analysis & Mining Visualization Deployment


MySQL, Oracle, ... Row Statistics R via BIRT
SAS, SPSS, ... Column Data Mining JFreeChart PMML
Excel, Flat, ... JavaScript
Matrix Machine Learning Plotly
XML, JSON
Hive, Impala, ... Text, Image Web Analytics Community / 3rd Databases
XML, JSON, PMML Time Series Text Mining Excel, Flat, etc.
Text, Doc, Image, ... Java Network Analysis Text, Doc, Image
Web Crawlers Python Social Media Analysis Industry Specific
Industry Specific Community / 3rd R, Weka, Python Community / 3rd
Community / 3rd Community / 3rd

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 13
14 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Overview

• Installing KNIME Analytics Platform


• The KNIME Workspace
• The KNIME File Extensions
• The KNIME Workbench
– Workflow editor
– Explorer
– Node Repository
– Description
• Installing new features

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 14
15 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Install KNIME Analytics Platform

• Select the KNIME version for your


computer:
– Mac
– Windows – 32 or 64 bit
– Linux
• Download archive and extract the
file, or download installer package
and run it

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 15
16 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Start KNIME Analytics Platform

• Use the shortcut created by the installer

• Or go to the installation directory and launch KNIME via the knime.exe

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 16
17 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
The KNIME Workspace

• The workspace is the folder/directory in which workflows (and


potentially data files) are stored for the current KNIME session.
• Workspaces are portable (just like KNIME)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 17
18 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
The KNIME Workbench

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 18
19 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
KNIME Explorer

• In LOCAL you can access your own


workflow projects.
• The Explorer toolbar on the top
has a search box and buttons to
– select the workflow displayed in
the active editor
– refresh the view
• The KNIME Explorer can contain 4
types of content:
– Workflows
– Workflow groups
– Data files
– Shared Components

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 19
20 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Creating New Workflows, Importing and Exporting

• Right-click in KNIME Explorer to create new workflow or workflow group


or to import workflow
• Right-click on workflow or workflow group to export

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 20
21 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Node Repository

• The Node Repository lists all


KNIME nodes

• The search box has 2 modes


– Standard Search – exact match
of node name
– Fuzzy Search – finds the most
similar node name

• Nodes can be added by drag and


drop from the Node Repository
to the Workflow Editor.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 21
22 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Console and Other Views

• Console view prints out error


and warning messages about
what is going on under the hood

• Click on View and select Other…


to add different views
– Node Monitor, Licenses, etc.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 22
23 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Description

• The Description window gives


information about:
– Node Functionality
– Input & Output
– Node Settings
– Ports
– References to literature

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 23
24 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Workflow Description

• When selecting the workflow,


the Description window gives
information about the
workflow’s:
– Title
– Description
– Associated Tags and Links
– Creation Date
– Author

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 24
25 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Workflow Coach

• Node recommendation engine


– Gives hints about which node use next in the workflow
– Based on KNIME communities' usage statistics
– Based on own KNIME workflows

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 25
26 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Tool Bar

The buttons in the toolbar can be used for the active workflow. The most
important buttons:
– Execute selected and executable nodes (F7)
– Execute all executable nodes
– Execute selected nodes and open first view
– Cancel all selected, running nodes (F9)
– Cancel all running nodes

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 26
27 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
KNIME File Extensions

• Dedicated file extensions for Workflows and Workflow groups associated


with KNIME Analytics Platform

• *.knwf for KNIME Workflow Files

• *.knar for KNIME Archive Files

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 27
28 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
More on Nodes…

A node can have 3 states:

Not Configured:
The node is waiting for configuration or incoming data.

Configured:
The node has been configured correctly, and can be executed.

Executed:
The node has been successfully executed. Results may be
viewed and used in downstream nodes.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 28
29 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Inserting and Connecting Nodes

• Insert nodes into workspace by dragging them from Node Repository or


by double-clicking in Node Repository
• Connect nodes by left-clicking output port of Node A and dragging the
cursor to (matching) input port of Node B
• Common port types:
Model
Flow Variable

Image

DB Connection DB Data
Data

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 29
30 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Node Configuration

• Most nodes require configuration


• To access a node configuration window:
– Double-click the node
– Right-click -> Configure

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 30
31 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Node Execution

• Right-click node
• Select Execute in the context menu
• If execution is successful, status
shows green light
• If execution encounters errors, status
shows red light

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 31
32 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Node Views

• Right-click node
• Select Views in context menu
• Select output port to inspect execution results

Plot View Data View

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 32
33 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Curved Connections!

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 33
34 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Getting Started: KNIME Example Server

• Connect via KNIME Explorer to a public repository with large selection of


example workflows for many, many applications
• Workflows also available on KNIME Hub

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 34
35 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Sharing Workflows
How to use the KNIME Hub

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 35
36 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
KNIME Hub

A place to share knowledge about


Workflows and Nodes
https://hub.knime.com
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 37 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
The KNIME Hub

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 37
38 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Searching Nodes and Workflows

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 38
39 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Opening a Workflow from the Hub

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 39
40 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Open Workflow in KNIME Analytics Platform

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 40
41 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Saving the Workflow

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 41
42 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Edit the Workflow

Drag & Drop

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 42
43 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Sharing the Workflow
1. Save your Edits

2. Connect to KNIME Hub

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 43
44 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Log in the Hub

KNIME Forum
Account Credentials

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 44
45 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Publish your Workflow

1. Edit Metadata

2. Drag & Drop

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 45
46 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Open your Workflow in the Hub

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 46
47 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Open your Workflow in the Hub

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 47
48 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Hot Keys (for Future Reference)
Task Hot key Description
Node Configuration F6 opens the configuration window of the selected node
F7 executes selected configured nodes
Shift + F7 executes all configured nodes

Node Execution Shift + F10 executes all configured nodes and opens all views
F9 cancels selected running nodes
Shift + F9 cancels all running nodes
Node Connections Ctrl + L connects selected nodes
Ctrl + Shift + L disconnects selected nodes
Ctrl + Shift + Arrow moves the selected node in the arrow direction
Move Nodes and Ctrl + Shift + PgUp/PgDown moves the selected annotation in the front or in the back of all
Annotations overlapping annotations
F8 resets selected nodes
Ctrl + S saves the workflow
Workflow Operations
Ctrl + Shift + S saves all open workflows
Ctrl + Shift + W closes all open workflows
Metanode Shift + F12 opens metanode wizard

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 48
49 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Stay connected with KNIME

Blog: knime.com/blog

Follow us on social
media:
Forum: forum.knime.com

KNIME Hub:
hub.knime.com

KNIME E-Learning Course:


www.knime.com/e-learning-course

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 49
50 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Today’s Example: Next Best Offer (NBO)

• Traditional Direct Marketing advertises a single product to a specific


audience. The Next Best Offer (NBO) approach focuses on taking existing
customers (and their data) and using upsell models to find interesting
new products for them.

• Today we construct a workflow that joins diverse data sources into a set
of complete customer records. Using this, we will build and deploy a
predictive model to find people who might be interested in a newly
available product.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 1
51 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
The Data

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 2
52 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
The Data

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 3
53 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Today’s Example: Next Best Offer (NBO)

Explore the final workflow


This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 4
54 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Importing Data
Accessing Files and Databases

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 1
55 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Source Nodes

Typically characterized by:


− Orange color
− No input ports, 1-2 output ports

Output port
Status

Node label

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 2
56 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: File Reader

Workhorse of the KNIME Source nodes


• Reads all text based files (e.g. csv, txt, etc.)
• Many advanced features allow it to read most ‘weird’ files
– Short lines, inline comments, headers and special encoding

YouTube KNIME TV Channel video:


https://youtu.be/flaHQw-Qhlg
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 3
57 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
File Reader Configuration

File path

Basic
Settings
Advanced
Settings

Preview

Help Button

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 4
58 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Alternative Faster Way …

Drag & Drop


OR
Copy & Paste

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 5
59 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Filenames and the knime:// Protocol
Absolute URL

Mountpoint-relative URL

Local path

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 6
60 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Workflow-Relative File Paths

• Best choice if workflows are to be


shared
• Requires matching folder structure
within workflow group
− Independent of environment outside of
workflow group
• Example: Path to „Sentiment Analysis.table“
− Local path:
C:\Users\rb\knime-workspace\KNIMEUserTraining\data\Sentiment Analysis.table
− Workflow relative:

YouTube KNIME TV Channel: https://youtu.be/U9sP4g4yGwY

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 7
61 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Excel Reader (XLS)

• Reads .xls and .xlsx file from Microsoft Excel


• Supports reading from multiple sheets

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 8
62 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Excel Reader Configuration
File path

Sheet
specific
settings

Preview
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 9
63 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Table Reader

• Reads tables from the native KNIME Format.


• Maximum performance, minimum configuration

File path

YouTube KNIME TV channel video:


https://youtu.be/tid1qi2HAOo
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 10
64 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Database Connectivity

• Read data from any JDBC enabled database


• Write your own SQL or model it using dedicated nodes

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 11
65 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Nodes: Database Connectors

• Native: Postgres, MySQL, MS SQL Server, SQLite


• Database Connector (e.g. Oracle, DB2, HANA).
• Big Data: HIVE and Impala

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 12
66 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Other Useful Data Sources

• PMML Reader – reads standard predictive models


• XML Reader with XPATH support
• Python/R Source nodes
• Tika Parser – extracts textual data from 200+ file types
• REST Web Services, and many more

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 13
67 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Importing Data Exercise

Start with exercise: Importing Data


Read the following files
• Sentiment Analysis.table
• Sentiment Rating.csv
• Product Data2.xls

Optional: Read table web_activity from the database


WebActivity.sqlite
(hint: drag and drop the files from the KNIME Explorer
panel to get started)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 14
68 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Today’s Example

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 15
69 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Manipulation
Clean, Join, Aggregate

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 1
70 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Manipulation Nodes

• Yellow color with a variety of input and output ports


• Apply a transformation to input data
• Many, many nodes!

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 2
71 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Concatenate

Combine rows from 2 tables with shared


columns
• Handles duplicate row keys gracefully
• Take the union or intersection of columns

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 3
72 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Cell Replacer

Replaces the content of a column based on a


lookup
• Top port references the table to be searched
• Bottom port holds the lookup table (search
keys and replacement values)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 4
73 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: String Manipulation

Create and edit values in String columns


• Clean up capitalization (eg.
Lowercase)
• Replace strings
• Modify existing strings or create new
columns

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 5
74 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Manipulation Exercise, Activity I

Start with exercise: Data Manipulation, Activity I


• Concatenate web activity data from old and new systems
• Replace sentiment evaluation (strings) with corresponding numeric
values
• Use String Manipulation to ensure that all entries of the Products
column are lower case from the product data spreadsheet

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 6
75 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Joining Columns of Data
Left Table Right Table

Join by ID

Inner Join

Left Outer Join Right Outer Join


Missing values in the Missing values in the
right table. left table.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 7
76 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Joining Columns of Data
Left Table Right Table

Join by ID

Full Outer Join

Missing values in
the right table
Missing values in
the left table

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 8
77 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Joiner

Combines columns from 2 different


tables
• Top port contains “Left” data table
• Bottom port contains “Right” data
table

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 9
78 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Joiner Configuration – Linking Rows

Joiner mode

Values to join on.


Multiple joining columns
are allowed

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 10
79 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Joiner Configuration – Column Selection

Columns from left


table to output table

Columns from right


table to output table

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 11
80 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Aggregation

RowID Group Value


r1 m 2
r2 f 3 RowID Group Sum(Value)
r3 m 1 r1+r3+r6 m 8
r4 f 5 r2+r4+r5 f 15
r5 f 7
r6 m 5

Aggregated on “Group” by method:


sum(“Value”)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 12
81 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: GroupBy

Aggregate rows to summarize data


• First tab provides grouping options
• Second tab provides control over aggregation details Aggregation columns

Aggregation methods

YouTube KNIME TV video:


https://youtu.be/bDwF-TOMtWw
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 13
82 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Duplicate Row Filter

Detect duplicate row and apply a selected treatment


• First tab provides the option to select columns
• Second tab provides options for treating duplicated values Flag or Remove
Duplicates

Select criteria to
keep row

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 14
83 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Column Expression

• Append or modify an arbitrary


number of columns using
expressions
• Many different functions are
available
• No restriction on number of
lines per expression allow to
write complex expressions
• Part of the KNIME Labs
extension

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 15
84 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Workflow Organization and
Documentation

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 16
85 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Comments & Annotations
Double-click to
Double-click to write
write
Use the
Use thepanel
paneltoto
change properties
change properties

YouTube KNIME TV Channel:


https://youtu.be/AHURYB_O8sA

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 17
86 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Workflow Organisation – Good Practices

• Workflow annotations
• Node labels
• Metanodes
− Right click -> Create
Metanode...
− Organize workflow by
task
− Hide complexity &
improve readability

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 18
87 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
KNIME WorkflowDiff

• Automates identification and comparison of nodes in a workflow,


metanodes, and two different workflows
• Identifies insertions, deletions, substitutions, and parameter changes

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 19
88 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Manipulation Exercise, Activity II

Start with exercise Data Manipulation, Activity II


• Join all data together using a series of joiner nodes and the
“Customer Key” field
• Resolve duplicates in the joined dataset
• Clean up and document your workflow using annotations, node
labels, and metanodes

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 20
89 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Today’s Example

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 21
90 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Visualization
Charts and Tables

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 1
91 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Visualization

• Large selection of easy to use


visualization nodes
− Web-based and interactive
− Dedicated nodes,
no scripting required
• Plotly nodes
− Similar but integrated from an
external library

• R and Python View nodes for


highly customizable graphics
− Require scripting

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 2
92 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Visualizations using 1 Column

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 93 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Visualizations using 2 Columns

1 Column

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 94 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Visualizations using 3 Columns

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 95 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Scatter Plot
Image
outport
• Plots different columns on X and Y
• Displays data including color
information Interactivity
options
• Produces an interactive view and
an image
• Select data points and publish
selection to other views

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 6
96 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Scatter Plot

Four configuration tabs

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 7
97 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Color Manager

• Color by nominal or continuous values


• Sync colors between views using the color model port and Color
Appender node

Color range
Discrete
for numerical
colors for
values
nominal
values

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 8
98 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Bar Chart

• Show numerical values across


categories
• Vertical or horizontal bars
• Bars can be grouped or stacked

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 9
99 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Line Plot

• Plot sequence of values,


e.g. over time
• Useful to identify trends,
also between groups

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 10
100 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Stacked Area Chart

• Visualizes numerical values


from multiple columns as
stacked areas
• Great for plotting distributions
over time

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 11
101 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Selection & Filtering in JavaScript Views

Interactivity allows you to select data points in views


− Selection is propagated to other views
− Highlight selected rows or filter them
− Click “Apply” to add column to data that indicates selection (true/false) for
use in downstream nodes

Apply selection

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 12
102 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Components – Combined Views

• Multiple JavaScript View nodes Scatter Plot


can be combined in Components
• Selections are transmitted to all
other views
• Also for use on the KNIME
WebPortal

Table View

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 13
103 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Interactivity across Charts: Selection and Filter Events

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 104 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Interactivity across Charts: Selection and Filter Events

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 105 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Interactivity across Charts: Selection and Filter Events

Subscribing to Selection and Filter


This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 106 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Interactivity across Charts: Selection and Filter Events

Subscribing to Selection and Filter


Publishing Selection
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 107 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Configure Content and Views Layout

• Click layout button when inside • Add views and rows via drag&drop
Component to assign views to rows • Add columns using + buttons
and columns

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 18
108 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Aggregation

Aggregation: Count
Sex Hair Age Sex blond brown black red
f blond 31 f 2 1 1 0
m red 22 m 1 1 0 2
f blond 53
m brown 16
f brown 47 Aggregation: Mean(Age)
f black 22 Sex blond brown black red
m blond 13 f 42 53 22 0
m red 55 m 13 16 0 38,5

Solution: Pivoting Node


This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 19
109 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Aggregation

Gender Hair Age


f blond 31
m red 22 Aggregation: Mean(Age)
f blond 53 Gender blond brown black red
m brown 16 f 42 53 22 0
f brown 47 m 13 16 0 38,5
f black 22
m blond 13
m red 55

Pivoting Node: Group - Pivot - Aggregate


This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 20
110 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Pivoting

Performs pivoting on selected columns for grouping and pivoting


• Values of group columns become unique rows
• Values of the pivot columns become unique columns for each set of
column combination together with each aggregation
• Many aggregation methods are provided (similar to GroupBy)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 21
111 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Pivoting
Groups ~ Rows

Pivots ~ Columns

Aggregation

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 22
112 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Script-based View Nodes

• R View nodes for greater customizability


− Use your favorite libraries, e.g. ggplot2

• If you prefer Python: Python View node


• For JS developers: Generic JavaScript View

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 23
113 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Visualization Exercise
Start with exercise: Visualization
• Read sales.csv data
• Use a Color Manager to color by Product
• Plot BasketValue against BasketSize using Scatter Plot
• Compare the sum of BasketValue by time and product in a Line Plot and a Stacked
Area Chart (Use the Pivoting node to get the sum of sales per Quarter and Product!)

• Read Fully Joined Data


• Show the number of customers in the web activity categories using Bar Chart
• Show the age distribution of customers using Histogram
• Create a composite view by combining the Bar Chart and Histogram
• Select one web activity class in the Bar Chart. Which age classes are represented in
the selected web activity class?

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 25
114 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Today’s Example

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 26
115 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Mining
Partition, Learn, Predict, Score

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 1
116 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Mining Strategies

Example Applications:
• Anomaly Detection (fraud, predictive maintenance)
• Association Rule Learning (market basket analysis)
• Clustering (market segmentation)
• Classification (next best offer, churn preventions)
• Regression (trend estimation)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 2
117 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Mining: Process Overview

Train
Training
Model
Set
Apply Score
Model Model
Original
Data Set
Test
Set

Train and Evaluate


Partition data
apply models performance
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 3
118 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Mining in KNIME

• KNIME has many modeling tools!


− Decision tree, random forest, SVM, regression,
neural networks, clustering, …
− and integrations with other libraries:
R, Python, H2O, WEKA, libSVM, etc.
• And many model evaluation nodes
− ROC, standard, numeric and entropy scorers
− Feature elimination
− Cross validation

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 4
119 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Partitioning

• Use to split data into training and evaluation sets


− Partition by count (e.g. 10 rows) or fraction (e.g. 10%)
− Sample by a variety of methods; random, linear, stratified

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 5
120 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Learner-Predictor Motif

• Most data mining approaches in KNIME


use a Learner-predictor motif.
• The Learner node trains the model with
its input data. Trained Model

• The Predictor node applies the model to


a different subset of data.

New Data!

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 6
121 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Classification

Predict nominal outcomes on existing data (supervised)


• Applications
− Churn analysis (yes/no)
− Chemical activity (active/inactive)
− Spam detection (spam/not spam)
− Optical character recognition (A-Z)

• Methods
− Decision Trees
− Neural Networks
− Naïve Bayes
− Logistic Regression
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 7
122 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Target Column

• Target column contains values that are predicted by the


classification model
• Binomial target values are often encoded to 1 and 0
Application Target Target Values
Column
Churn Churn Yes/No or 1/0
analysis
Chemical Active Yes/No or 1/0
activity
Spam Spam Yes/No or 1/0
Detection
Optical Character A-Z
Character
Recognition

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 8
123 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
KNIME’s Decision Tree

J.R. Quinlan, “C4.5 Programs for machine learning”


J. Shafer, R. Agrawal, M. Mehta, “SPRINT: A Scalable Parallel
Classifier for Data Mining”

• C4.5 builds a tree from a set of training data using the concept of
information entropy.
• At each node of the tree, the attribute of the data with the highest
normalized information gain (difference in entropy) is chosen to split the
data.
• The C4.5 algorithm then recurses on the smaller sub lists.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 9
124 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Decision Tree Learner

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 10
125 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Decision Tree View

Most of the people who


don‘t churn have more
than one contract

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 11
126 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Decision Tree Predictor

• Takes a decision tree model &


applies it to new data
• Check the box to append class
probabilities

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 12
127 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Scorer

Compare predicted results to known truth in


order to evaluate model quality

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 13
128 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Scorer
Confusion matrix shows the distribution of
model errors

An accuracy statistics table provides a


detailed analysis of model quality

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 14
129 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Confusion Matrix

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 15
130 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Receiver Operating Characteristics

• Sort by confidence in target class


• Plot true positive rate vs false positive rate
• Ideal models achieve 100% TPR with 0% FPR
• Area under the curve indicates model
quality
− (1=ideal model, 0.5 = random outcome)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 16
131 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: ROC Curve

• Requires individual class probabilities from


a preceding predictor
• User must define:
1. Original class column
2. Positive class value
3. Probability for the selected positive class
value for one or multiple models

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 17
132 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Mining Exercise, Activity I
Start with exercise: Data Mining, Activity I:
• Partition the fully joined data
− 50%, Stratified Sampling
• Train a decision tree on the training data
− Predict the Target column
• Use the model to predict the upsell potential (Target =1 or Target=0) for
remaining records
• Evaluate the quality of a model with a Scorer.
• Optional: Find AUC for the model using ROC curve

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 18
133 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Regression

Predict numeric outcomes on existing data (supervised)


• Applications
− Forecasting
− Quantitative Analysis

• Methods
− Linear
− Polynomial
− Regression Trees
− Partial Least Squares

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 19
134 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Nodes: Linear Regression Learner & Regression Predictor

A linear model relating a dependent variable to 1 or more independent


variables
− Model coefficients provided in 2nd output port
− Also available: Polynomial and Tree Ensemble Regression nodes

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 20
135 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Numeric Scorer

Similar to scorer node, but for nodes with


numeric predictions (e.g. linear/polynomial
regression)
• Compare dependent variable values to
predicted values to evaluate goodness of fit.
• Report R2, RMSD, SEM etc.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 21
136 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Mining Exercise, Activity II

Start with exercise: Data Mining, Activity II:


• Read the weather.table
• Split the data into 2016 for training and use 2017 as test data
• Train a linear regression model that predicts the AIR_TEMP as a function
of all other parameters in the data set
• Use the model to predict the temperature in 2017 and evaluate it with
the Numeric Scorer
• Optional: Calculate mean temperature per month on the training data
− Join the mean temperature to the test data set (2017)
− Use the Numeric Scorer to see if the easiest model is better than the Linear
Regression

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 22
137 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Today’s Example

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 23
138 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Clustering

Discover hidden structure in unlabeled data (unsupervised)

• Applications
− Market Segmentation
− Diversity picking
• Methods
− K-means/medoids
− Hierarchical
− DBScan
− OPTICS
− Neighbourgrams

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 24
139 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Nodes: k-Means Clustering

• Looks at n observations to define


the means for k clusters.
• Each observation is then assigned
to its closest cluster center.
• You must provide k.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 25
140 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Mining Exercise, Activity III

Start with exercise: Data Mining, Activity III


• Read the location_data.table file
• Filter to entries from California (region_code = CA)
• Train a k-means model with k=3. Use only position data for clustering
(latitude and longitude)
• Optional: Plot latitude and longitude in a view (OSM Map or Scatter Plot)
and use that to help you visually optimize k.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 26
141 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Model Optimization

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG
1 142 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
KNIME’s Tree Ensemble Models

• The general idea is to take X


advantage of the “wisdom of the
crowd” 1 4


1

5 2 2 7 7 6

• Ensemble models: Combining 2 9 6 7 6 8 9 3 3 9 5 7

predictions from a large number P1 P2 … Pn

of weak predictors, e.g. decision y


trees
Typically: for classification the
• Leads to a more accurate and individual models vote and the
robust model majority wins; for regression, the
individual predictions are
• This is called ”bagging” averaged

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 143 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
How Does Bagging Work?

• Pick a different random subset of the training data for each model in the
ensemble (bag)

Build tree Build tree Build tree

1 4 1

5 2 5 7
… 7 6

2 9 6 7 2 8 9 3 3 9 5 7

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 144 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute. 3
An Extra Benefit of Bagging: Out of Bag Estimation

• Allows testing the model using the training data: when validating, each
model should only vote on data points that were not used to train it

X1 X2

1 4 1 1 4 1

5 2 2 7 … 7 6 5 2 2 7 … 7 6

2 9 6 7 6 8 9 3 3 9 5 7 2 9 6 7 6 8 9 3 3 9 5 7

P1 P2 … Pn P1 P2 … Pn

y1OOB y2OOB

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 145 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute. 4
Random Forest

• Bag of decision trees, with an extra element of


randomization when building the trees: each
node in the decision tree only “sees” a subset of
the input columns, typically 𝑁
• Random forests tend to be very robust w.r.t.
overfitting (though the individual trees are almost
certainly overfit) Build tree
• Extra benefit: training tends to be much faster
1

5 2

2 9 6 7

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 146 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute. 5
New Nodes: Random Forest Learner

• The output model describes a


random forest and is applied in the
corresponding predictor node using a
simple majority vote
• The statistics table on the attributes
tells how often each attribute…
– … is used in the first three splits
– … was a possible candidate in the first
three splits

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 147 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Tree Ensembles

• Random Forest variant


• More options to set
• Trees may be trained using
subsets of rows and/or columns
and this approach may lead to
greater accuracy

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG
7 148 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Nodes: Tree Ensemble Learner/Predictor

• Choose which columns to include


• Configure a prototype tree (depth, split criteria etc.)
• Setup ensemble parameters (model count, row/column subsampling)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 149 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Advanced Data Mining Exercise, Activity I

Start with exercise: Advanced Data Mining, Activity I


• Read the data file CurrentDetailData.table
• Partition the data 50/50 using stratified sampling on the Products column
• Create a Tree Ensemble model to predict the “Products” column
− Use a tree depth of 5, 50 models, and 75% of rows and columns for each
iteration.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG
9 150 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Parameter Optimization

• Some modeling approaches are very sensitive to their configuration


• Calculating optimum settings is not always possible
• Parameter Optimization loops may help find a good configuration

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 151 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
The Loop Block

• A loop block is defined by appropriate loop start and loop end nodes
• Loop body = Nodes in between and side branches

Loop body

Loop
end
Loop start node
node

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG
11 152 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Parameter Optimization Loop Start

• Define some parameters to


optimize
• Set upper/lower bounds and
step sizes (and flag integers)
• Choose an optimization method
− Brute force for maximum accuracy
but slower computation
− Hillclimbing for better faster
runtimes but may get stuck in local
optimum settings
− Random search to randomly search
for parameter values within a given
range
− Bayesian Optimization (TPE)
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 12
153 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Parameter Optimization Loop End

• Collects some value to optimize


in a flow variable.
• Value may be maximized
(accuracy) or minimized (error)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG
13 154 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Loop Start and End Nodes

• For different tasks, there are different loop start


and end nodes
• Nodes with circular arrows (green) = Start node
• Nodes with a closed circle (red) = End node
• Flow Variables are really helpful to build the
loop body

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 155 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Flow Variables

What is a Flow Variable?


• Flow Variables are workflow parameters
• Flow Variables are used to overwrite existing node settings
• Flow Variables can be of type String, Integer, or Double

How can we create a Flow Variable?


• Special nodes have Flow Variables as output
– Parameter Optimization Loop Start node
– Table Row to Variable node
• In the “Flow Variables” tab of any node

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG
15 156 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Flow Variable Ports

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG
16 157 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Apply a Flow Variable (Button)

The Flow Variable button

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG
17 158 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Apply a Flow Variable (Advanced)

The Flow Variable


tab

List of available Flow


Variables

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG
18 159 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Create a Flow Variable (Button)

Name of the new


Flow Variable

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG
19 160 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Create a Flow Variable (Advanced)

Converting a setting value into a Flow Variable

Name of the new


variable

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG
20 161 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Advanced Data Mining Exercise, Activity II

Start with exercise: Advanced Data Mining, Activity II


• Read the data file CurrentDetailData.table
• Partition the data 50/50 using stratified sampling on the column Products
• Build a parameter optimization loop for a Tree Ensemble Model to predict the
Prodcuts
• Use Hill climbing to determine the optimum number of models (min=10,
max=200, step=10, int = yes)
• Maximize the accuracy in the Loop End node
• What were the optimal settings?

(Hint: don’t forget to use the Flow Variable in your learner node)
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG
21 162 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Integrating External Tools

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 1
163 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Goal of This Session

• This session gives a quick overview of the external tools that can be called
within KNIME, e.g.:
− Java, R, Python
− Web services

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 164 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or 2
distribute.
KNIME Labs

• KNIME Labs enables you to preview new KNIME


features and plug-ins that are still under
development
• The nodes provided in KNIME Labs are not (yet)
part of the official KNIME version because the
functionality and/or API may not be finalized
• You can get these plug-ins by installing the
extension from the KNIME Labs extensions
category

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 165 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or 3
distribute.
Java Snippet

• Fastest running scripting node in


KNIME
• Syntax highlighting, auto
completion, error checking
• Templates allow you to save scripts
for later re-use
• Import custom libraries

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 4
166 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Java Edit Variable

• Same as Java snippet, but


with flow variable ports

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 5
167 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
R Integration

• Run any R code from


KNIME Train model
Apply model
• Works with existing R
installations
• Nodes for many tasks
Run any R code
• First run:
install.packages(‘Rserve‘)
and Pass one
workspace to
install.packages(‘Cairo‘)* multiple nodes
*mac only
Create plot

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 6
168 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
R Integration

Syntax highlighting
Create and
store
templates

R workspace

Show
results

Evaluate
script R console
output

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 7
169 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Python Integration

• Run Python inside KNIME


• Works with existing installations
• UI modeled after R integration

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 8
170 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Python Scripting UI

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 9
171 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
RESTful Web Services

• Use KNIME nodes to interact with


RESTful web services
• Send requests using standard
HTTP methods

JSON Response:

XML Response:

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 10
172 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
RESTful Web Services
Provide authentication
Enter URL, or use if necessary
from column

Add delay between


individual requests

https://www.knime.com/blog/a-restful-way-to-find-and-retrieve-data
https://www.knime.com/blog/OSM-meets-CSV-file-and-Google-API

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 11
173 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
KNIME Server as a REST Resource

https://www.knime.org/blog/giving-the-knime-server-a-rest
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 12
174 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
KNIME Server as a REST resource

• Use cURL, SOAPUI or Chrome


extension Postman to explore the
REST API

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 13
175 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
JSON and JSON Path

• Use the JSON Reader (or the GET Resource) nodes


to get an JSON cell
• Use JSONPath nodes to query the JSON and extract
certain parameters
• Editor window simplifies construction of JSON
queries by auto-generating them (click on
properties)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 14
176 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
JSON Path

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 15
177 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
XML and XPath

• Use the XML Reader (or the GET Resource) nodes to get
an XML cell
• Use XPath nodes to query the XML and extract certain
parameters
• Editor window simplifies construction of XPath queries by
auto-generating them (click on XML elements)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 16
178 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
XPath

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 17
179 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Remote File Handling – Cloud Storage

• Integrate remote data sources from


Amazon AWS and Microsoft Azure
− Upload files
− Download files, or read their content
directly into KNIME
− List files in remote directories
− Create directories
− Delete files / directories

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 18
180 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Remote File Handling – Cloud Storage

Example: Upload all files from a local directory to Amazon S3

Enter
credentials

Upload!
Create
Create new
directory in
bucket
bucket
Create URIs
of local file
paths

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 19
181 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Google Sheets

• Access your data stored in


Google Services
− Read data from Google Sheets
− Write data to new sheets
− Modify existing sheets

• Makes collaboration and sharing


of data easy
− (especially vs. sending Excel
sheets via email...)
Authenticate via pop-
up window (Oauth2)
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 20
182 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Google Sheets

• Select from available sheets on Google Drive


• Transform data in KNIME, or enrich with new data
• Create new sheet or update existing sheets
− Allows to read from / write to specific range of sheet (e.g. A1:G10)

Authenticate via pop-


up window (Oauth2)

Select from available sheets, Specify target sheet, select


open in browser for preview which columns to write, etc.

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 21
183 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Exercises (Optional)

Start with exercise: Integrating External Tools


• Use the GET Request node to call an external web service
− https://raw.githubusercontent.com/tamingtext/book/master/apache-
solr/example/exampledocs/books.json
• Use the JSON Path node to output subparts of the JSON query
– Extract all possible subparts
– Output the subparts as a list
• Use the Ungroup node to split the single JSON queries into separate rows
• Use the JSON Path node to separate “name“, “author“, and “price“ from
the JSON queries
– Hint: Right click on a value in the JSON query and select “Add JSONPath“

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 22
184 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Exporting Data & Deployment

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 1
185 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Exporting Data

After an analysis is completed, what next?


− Write results to a file
− Create/update a database
− Save the model for use elsewhere
− Generate a rich report
− Deploy via KNIME WebPortal
− Deploy via workflow as RESTful web service

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 2
186 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Input/Output in Deployment

Input Output
• File (CSV, Table, XLS, …) • Report (BIRT, Tableau,
• Database Spotfire)
• JSON for REST API • Email
• File (CSV, Table, XLS, …)
• WebPortal

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 3
187 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
To Report / Email

To BIRT Report

Also available:
Nodes for Tableau
and Spotfire

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 4
188 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
To File / Database

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 5
189 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
REST API (Available on KNIME Server)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 6
190 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
To Dashboard on WebPortal

Step 3
Step 1 Step 2 Step 4 Step 5
Customize
Upload File Select Columns Interactive View Download Image
Column Domains

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 7
191 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Workflow on KNIME WebPortal

WebPortal Page
(Step 1)
Upload File

Available in
WebPortal Page
KNIME Server (Step 4)
Interactive View

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 8
192 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Components to Produce Dashboard on Web Page

File
Selection Column
Selection

Stacked
Area Chart
Filter by
Row Filter
Range

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 9
193 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Data Export Nodes

Typically characterized by:


− Magenta color
− 1 input port, no output ports
− Create file on file system or
write to database

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 10
194 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Table Writer

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 11
195 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: XLS Writer

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 12
196 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Database Writer

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 13
197 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Automation: Call Local Workflow

• Use Call Local Workflow node to send


data and parameters to other
workflows and trigger execution
− Send results back to caller-workflow
− Include report from called workflow Path to workflow

• Create modular workflows


Add report to output
− E.g. separate workflows for ETL and
prediction
• Alternative: Call Remote Workflow Click to query the
expected input(s)
− Trigger execution of workflows on
KNIME Server via REST API
Specify source column(s)
with input data / parameters

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 14
198 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Automation: Call Local Workflow

Convert output data


to KNIME table
ETL
Calls workflow once
for each input row

Send results back to


caller-workflow
Prediction

Enter default format


for incoming data
This educational material was produced for the course held at ®
Copyright © 2019 KNIME AG 15
199 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Use Call Local Workflow to Send Conditional Emails with Report

Sometimes, report should be sent under specific circumstances


• E.g. if some KPI is below threshold

Provide email
credentials, host, etc.
Convert binary
column to file and
save to temp dir

Workflow creates Define rule, only


Path to file
report, sends back send email if
with report
binary column conditions apply

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 16
200 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Reporting in KNIME

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 17
201 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Reporting in KNIME

• Reporting in KNIME is done via a 3rd


party application named BIRT
(Business Intelligence Reporting Tool)
• Data is sent to BIRT from KNIME using
special nodes.
• Reports in BIRT are constructed from
report items, which may include
images, tables, charts and labels.
• Reports may be generated in a
variety of formats (html, pdf, pptx,
xlsx, docx, …)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 18
202 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Installation

• Can be installed via KNIME -> Install KNIME Extension


• Install the KNIME Report Designer

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 19
203 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Data to Report

Send a data table to BIRT

Set the node


label!

Hint: The node label will be used to identify the


data source in the reporting view -> Make sure
to use understandable labels if you have more
than one data source

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 20
204 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
New Node: Image to Report

Send an image to BIRT


− PNG and SVG are supported
formats (see node description
for details)

Hint: Customize the image size in the Data to


Report node to fit the report

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 21
205 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Edit the Report

Open the workflow and click the Report Editor button in the tool bar

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 22
206 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Reporting Perspective
Click button to
create report

Data from KNIME


- names of data Report layout – only
sources are taken structure, data is filled in
from node label when creating the report
View tabs

Add report items


via drag and drop

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 23
207 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Charting in BIRT

• Many chart types


• Fine control of plot appearance
• Familiar ‘Excel Like’ interface
• Supports interactivity

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 24
208 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Tips & Tricks

• Use an underlying grid to structure the report


• Names of columns should not change
• Use the grouping function to combine results
• Use the Master Layout Tab (For footers etc.)

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 25
209 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Exporting Data Exercise

Start with exercise: Exporting Data


• Send a heatmap that shows the confusion matrix values to report via
Image to Report node
• Send a table that shows the model accuracy statistics to report via Data to
Report node
• Create a report that includes the following elements:
– A report title
– A table with the overall accuracy
– The heatmap image
• Generate a PDF of your report

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 26
210 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
Today’s Example

This educational material was produced for the course held at ®


Copyright © 2019 KNIME AG 27
211 ODSC India 2019 in Bangalore on Aug 10, 2019. Do not copy or
distribute.
The End
education@knime.com

Copyright © 2019 KNIME AG

Vous aimerez peut-être aussi