Académique Documents
Professionnel Documents
Culture Documents
Rosaria.Silipo@knime.com
cpearl@gmail.com
Kilian.Thiel@knime.com
Tool Installation
1
9/23/2014
KNIME 2.10
Text Processing Extension from KNIME Labs
Extensions
Distance Matrix from KNIME Extensions
Memory Tip
In file knime.ini set memory to max available
-Xmx 3G
2
9/23/2014
Resources
The KNIME Website (www.knime.org)
LEARNING HUB under RESOURCES (www.knime.org/learning-
hub)
Use Cases and White Papers for example workflows, and
KNIME TV channel on
Text Mining Webinar http://www.youtube.com/watch?v=tY7vpTLYlIg
KNIME on @KNIME
Copyright 2014 KNIME.com AG 5
Resources
eBooks from the KNIME Press:
http://www.knime.org/knimepress
Free Beginners
- KNIME Beginners Luck Guide use Code
meetupsf14
- The KNIME Cookbook
- The KNIME Booklet for SAS Users
3
9/23/2014
2. Enrichment
(Tagging)
Document
Type
Cell
4. Transformation
Term BoW, Frequencies,
Type Document Vector
Cell
4
9/23/2014
Demo Workflows
5
9/23/2014
6
9/23/2014
7
9/23/2014
Goal:
Build a classifier to distinguish between Chinese and
Italian restaurants, based on the reviews.
Italian or Chinese
Restaurant?
Goal:
8
9/23/2014
1.) Reading
Demo
Reading
Read Tripadvisor data (.table file)
Filter rows with missing restaurant value
Convert strings to documents
Filter all but the document column
Examples of other possible formats to import
9
9/23/2014
Demo
Reading
Web Crawler Workflow to get data from the Web
Palladian Community Contributions Extension
HtmlParser node
Xpath node
10
9/23/2014
2.) Enrichment
Demo
Enrichment / Tagging
Apply POS Tagger node
Use Bag of Words node to inspect tagging result
Show other possible Taggings
11
9/23/2014
3.) Preprocessing
Demo
Preprocessing
Filter
Numbers
Punctuation marks
Stop Words
Convert to lower case
Stemming (Snowball stemmer because of the many
languages associated with it)
Keep only nouns (NN), verbs (VB), adjectives (JJ)
12
9/23/2014
4.) Transformation
Demo
Transformation
Transform to bag of word
Compute TF value for terms
TFrel (word) = n(word)/N
IDF(word) = log(1+(n(docs)/n(word, docs))
Tfrel(word) * IDF(word) is used often
ICF(word) = log(1+(n(cat)/n(word, cat))
Sort output data by frequency
13
9/23/2014
4.) Transformation
Demo
Transformation
Transform to document vectors
Extract category (class) value
14
9/23/2014
5.) Classification
Demo
Classification
Append color based on class
Partition data into training and test set
Train decision tree model in training data
Apply decision tree model on test data
Score model, measure accuracy
Show cross-validation loop
15
9/23/2014
Additional Workflows
Thank You
Questions
http://tech.knime.org/forum
Rosaria.Silipo@knime.com
60k
Follow us
40k
Twitter: @KNIME
20k
LinkedIn: https://www.linkedin.com/groups?gid=2212172
KNIME Blog: http://www.knime.org/blog
16