Académique Documents
Professionnel Documents
Culture Documents
Competitions
Mark Peng
2015.12.16 @ Spark Taiwan User Group
https://tw.linkedin.com/in/markpeng
Kaggle Profile (Master Tier)
https://www.kaggle.com/markpeng 2
Things I Want to Share
3
Kaggle Competition Dataset and Rules
Testing Dataset
4
How to become a Kaggle Master
Reference: https://www.kaggle.com/wiki/UserRankingAndTierSystem
5
Recommended Data Science Process (IMHO)
Exploratory Final
Data Analysis Predictions
Relative Time
6
Cross Validation
The key to avoid overfitting
7
Underfitting and Overfitting
We want to find a model with lowest generalization error (hopefully)
Underfitting Overfitting
Testing Error
(Private LB)
Training Error
Low High
Model Complexity
8
Big Shake Up on Private LB!
Reference: https://www.kaggle.com/c/restaurant-revenue-prediction/leaderboard/private
9
Who is the King of Overfitting?
Public LB RMSE:
Private LB RMSE:
Num. of Features: 41
Training Data Size: 137
Testing Data Size: 10,000
Training Data
Validation Data
13
Should I Trust Public LB?
Yes if you can find a K-fold CV that follows the same trend with
public LB
High positive correlation between local CV and public LB
The score increases and decreases in both local CV and public LB
Trust more in your local CV!
14
Data Cleaning
Making data more analyzable
15
Recommended Data Science Process (IMHO)
Exploratory Final
Data Analysis Predictions
Relative Time 16
Data Cleaning Techniques
17
Data Cleaning Techniques (cont.)
18
Data Cleaning Techniques (cont.)
19
Data Cleaning Techniques (cont.)
20
Data Cleaning Techniques (cont.)
21
Mostly Used Models
What models to use?
22
Mostly Used ML Models
Model Type Name R Python
Linear Regression glm, glmnet sklearn.linear_model.LinearRegression
sklearn.naive_bayes.GaussianNB
Naive Bayes naiveBayes {e1071} sklearn.naive_bayes.MultinomialNB
Hyperplane-based sklearn.naive_bayes.BernoulliNB
glm, glmnet
Logistic Regression LiblinearR
sklearn.linear_model.LogisticRegression
sklearn.ensemble.RandomForestClassifier
Random Forests randomForest
sklearn.ensemble.RandomForestRegressor
sklearn.ensemble.GradientBoostingClassifier
Gradient Boosting Machines gbm
sklearn.ensemble.GradientBoostingRegressor
(GBM) xgboost
xgboost
nnet PyBrain
Neural Network Multi-layer Neural Network neuralnet Theano
23
Mostly Used ML Models (cont.)
Model Type Name Mahout Spark
Linear Regression LinearRegressionWithSGD
Extremely Randomized
Ensemble Trees Trees
Gradient Boosting Machines
GradientBoostedTrees
(GBM)
Neural Network Multi-layer Neural Network
Matrix Factorization parallelALS ALS
Recommendation
Factorization machines
K-means kmeans KMeans
Clustering
t-SNE
24
Feature Engineering
The key to success
25
Recommended Data Science Process (IMHO)
Exploratory Final
Data Analysis Predictions
Relative Time 26
Feature Engineering
27
Data Transformation
Feature Scaling
Rescaling
Turn numeric features into the same scale (e.g., [-1,+1], [0,1], )
Python scikit-learn: MinMaxScaler
Standardization
Removing the mean ( = 0) and scaling to unit variance ( = 1)
Python scikit-learn: StandardScaler
R: scale
To avoid features in greater numeric ranges dominating those in smaller
numeric ranges
Critical for regularized linear models, KNN, SVM, K-Means, etc.
Can make a big difference to the performance of some models
Also speed up the training of gradient decent
But not necessary to do it for tree-based models
28
Data Transformation (cont.)
29
Variable Transformation
In Liberty Mutual Group: Property Inspection Prediction
31
Feature Encoding (cont.)
Labeled Encoding
Interpret the categories as ordered integers (mostly wrong)
Python scikit-learn: LabelEncoder
One Hot Encoding
Transform categories into individual binary (0 or 1) features
Python scikit-learn: DictVectorizer, OneHotEncoder
R: dummyVars in caret package
32
Feature Extraction: HTML Data
HTML files are often used as the data source for classification
problems
For instance, to identify whether a given html is an AD or not
Possible features inside
html attributes (id, class, href, src, ....)
tag names
inner content
javascript function calls, variables, ....
script comment text
Parsing Tools
Jsoup (Java), BeautyfulSoup (Python), etc.
33
Feature Extraction: Textual Data
Counts of symbols
and spaces inside
text can also be
powerful features!
Using only 7 simple count features gives you over 0.90 AUC!
Reference: https://www.kaggle.com/c/dato-native/forums/t/16626/beat-the-benchmark-0-90388-with-simple-model
37
Feature Extraction: Hidden Features
38
Exploratory Data Analysis is Your Friend
Finding patterns and correlations from time series data
Reference: https://www.kaggle.com/darraghdog/springleaf-marketing-response/explore-springleaf/notebook
39
Exploratory Data Analysis is Your Friend
Using 2D visualization to observe the distribution of data in problem space
Reference: https://www.kaggle.com/benhamner/otto-group-product-classification-challenge/t-sne-visualization 40
Exploratory Data Analysis is Your Friend
Finding hidden groups inside the data
Reference: https://www.kaggle.com/markpeng/liberty-mutual-group-property-inspection-prediction/hazard-
41
groups-in-training-set/files
Feature Selection
42
Feature Selection Methods
Type Name R Python
sklearn.ensemble.RandomForestClassifier
randomForest sklearn.ensemble.RandomForestRegressor
Gini Impurity varSelRF sklearn.ensemble.GradientBoostingClassifier
sklearn.ensemble.GradientBoostingRegressor
Chi-square Fselector sklearn.feature_selection.chi2
Hmisc scipy.stats.pearsonr
Feature Importance Correlation Fselector scipy.stats.spearmanr
Ranking
sklearn.ensemble.RandomForestClassifier
randomForest sklearn.ensemble.RandomForestRegressor
Information Gain varSelRF sklearn.ensemble.GradientBoostingClassifier
Fselector sklearn.ensemble.GradientBoostingRegressor
xgboost
sklearn.linear_model.Lasso
L1-based Non-zero Coefficients glmnet sklearn.linear_model.LogisticRegression
sklearn.svm.LinearSVC
Recursive Feature Elimination
rfe {caret} sklearn.feature_selection.RFE
(RFE)
Feature Subset Boruta Feature Selection Boruta
Selection Greedy Search (forward/backward) Fselector
43
Feature Selection Methods
Type Name Spark
Gini Impurity RandomForest (see SPARK-5133)
Chi-square ChiSqSelector
Feature Importance
Ranking Correlation
44
Feature Importance (xgboost)
Reference: https://www.kaggle.com/tqchen/otto-group-product-classification-challenge/understanding-xgboost-
model-on-otto-data/notebook
45
Feature Importance (randomForest in R)
Likely Removable
Features
46
Feature Correlations
Correlations between top-20 features
Reference: https://www.kaggle.com/benhamner/otto-group-product-classification-challenge/important-
47
feature-correlations
Recursive Feature Elimination (RFE)
49
Recommended Data Science Process (IMHO)
Exploratory Final
Data Analysis Predictions
Relative Time 50
Ensemble Learning - Illustration
Less variance
Reference: http://www.scholarpedia.org/article/Ensemble_learning
51
Ensemble Learning - Stacked Generalization
Reference: http://www.scholarpedia.org/article/Ensemble_learning 53
Out-of-fold Prediction (K = 2)
model 1 model 1
Out-of-fold pred.
from 6 models
(meta features)
fold 2 fold 1
fold 1
predict() meta
model 1 model
fold 2
Testing pred.
Training Data Testing Data
from 6 models 54
Out-of-fold Prediction - Generation Process
Reference: http://mlwave.com/kaggle-ensembling-guide/ 55
Out-of-fold Prediction (cont.)
Reference: https://www.kaggle.com/c/otto-group-product-classification-
challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-
stanislav-semenov 57
Stacked Generalization Used by the Winning Team in
Truly Native? Competition (4 levels)
Reference: http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-
place-mad-professors/ 58
Stacked Generalization I Used in
Search Results Relevance Competition
Rank=15 Rank=50
Reference: Booz Allen Hamilton. (2015). Field Guide to Data Science, page 93. 60
Ensemble Learning - Tips
61
Team Up
Working with clever data geeks worldwide
62
The Advantages of Team Up
Fewer work loads for each one if divides up the work well
A focuses on feature engineering
B focuses on ensemble learning
C focuses on blending submissions
Each member can contribute some single models for blending
and stacking
63
Ways to Team Up
64
Ways to Team Up (cont.)
65
When to Start Team Merger?
66
How to Produce Better Predictions
67
Recommended Resources
Materials to help you learn by doing
68
Books
69
Massive Online Open Courses (MOOCs)
70
Other Resources
71
Believe in Yourself
You can make it to the 10 top
along as well!
72
Concluding Words
73
Thanks!
Contact me:
Mark Peng
markpeng.ntu@gmail.com
LinkedIn Profile
74