Vous êtes sur la page 1sur 67

Copyright 2010-2016, Dr.

Sa ed Sa ya d

An Introduction to Data Mining

converted by W eb2PDFConvert.com
Further Readings

converted by W eb2PDFConvert.com
Map > Data Mining

Data Mining
Data Mining is about explaining the past and predicting the future by means of data analysis. Data mining is a multi-
disciplinary field which combines statistics, machine learning, artificial intelligence and database technology. The value
of data mining applications is often estimated to be very high. Many businesses have stored large amounts of data over
years of operation, and data mining is able to extract very valuable knowledge from this data. The businesses are then
able to leverage the extracted knowledge into more clients, more sales, and greater profits. This is also true in the
engineering and medical fields.

Statistics
The science of collecting, classifying, summarizing, organizing, analyzing, and interpreting data.

Artificial Intelligence
The study of computer algorithms dealing with the simulation of intelligent behaviors in order to perform those activities
that are normally thought to require intelligence.

Machine Learning
The study of computer algorithms to learn in order to improve automatically through experience.

Database
The science and technology of collecting, storing and managing data so users can retrieve, add, update or remove such
data.

Data warehousing
The science and technology of collecting, storing and managing data with advanced multi-dimensional reporting services
in support of the decision making processes.

converted by W eb2PDFConvert.com
Map > Data Mining > Explaining the Past
Explaining the Past
Data mining explains the past through data exploration.

converted by W eb2PDFConvert.com
Map > Data Mining > Explaining the Past > Data Exploration

Data Exploration
Data Exploration is about describing the data by means of statistical and visualization techniques. We explore data in
order to bring important aspects of that data into focus for further analysis.

1. Univariate Analysis

2. Bivariate Analysis

converted by W eb2PDFConvert.com
Map > Data Mining > Explaining the Past > Data Exploration > Univariate Analysis

Univariate Analysis
Univariate analysis explores variables (attributes) one by one. Variables could be either categorical or numerical. There
are different statistical and visualization techniques of investigation for each type of variable. Numerical variables can be
transformed into categorical counterparts by a process called binning or discretization. It is also possible to transform a
categorical variable into its numerical counterpart by a process called encoding. Finally, proper handling of missing values
is an important issue in mining data.

1. Categorical Variables
2. Numerical Variables

converted by W eb2PDFConvert.com
Map > Data Mining > Explaining the Past > Data Exploration > Univariate Analysis > Categorical Variables

Categorical Variables
A categorical or discrete variable is one that has two or more categories (values). There are two types of categorical
variable, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, gender is a
categorical variable having two categories (male and female) with no intrinsic ordering to the categories. An ordinal
variable has a clear ordering. For example, temperature as a variable with three orderly categories (low, medium and
high). A frequency table is a way of counting how often each category of the variable in question occurs. It may be
enhanced by the addition of percentages that fall into each category.

Univariate Analysis - Categorical


Statistics Visualization Description
Count Bar Chart The number of values of the specified variable.
Count% Pie Chart The percentage of values of the specified variable.

Example: The housing variable with three categories (for free, own and rent).

Exercise

converted by W eb2PDFConvert.com
Map > Data Mining > Explaining the Past > Data Exploration > Univariate Analysis > Numerical Variables

Numerical Variables
A numerical or continuous variable (attribute) is one that may take on any value within a finite or infinite interval (e.g.,
height, weight, temperature, blood glucose, ...). There are two types of numerical variables, interval and ratio. An interval
variable has values whose differences are interpretable, but it does not have a true zero. A good example is temperature
in Centigrade degrees. Data on an interval scale can be added and subtracted but cannot be meaningfully multiplied or
divided. For example, we cannot say that one day is twice as hot as another day. In contrast, a ratio variable has values
with a true zero and can be added, subtracted, multiplied or divided (e.g., weight).

Univariate Analysis - Numerical


Statistics Visualization Equation Description
The number of values (observations)
Count Histogram N of the variable.
Minimum Box Plot Min The smallest value of the variable.
Maximum Box Plot Max The largest value of the variable.
The sum of the values divided by the
Mean Box Plot count.
The middle value. Below and above
Median Box Plot median lies an equal number of
values.
The most frequent value. There can be
Mode Histogram more than one mode.
A set of 'cut points' that divide a set of
data into groups containing equal
Quantile Box Plot numbers of values (Quartile, Quintile,
Percentile, ...).
The difference between maximum and
Range Box Plot Max-Min minimum.

Variance Histogram A measure of data dispersion.

Standard Deviation Histogram The square root of variance.

A measure of data dispersion divided


Coefficient of Deviation Histogram by mean.

A measure of symmetry or asymmetry


Skewness Histogram in the distribution of data.

A measure of whether the data are


Kurtosis Histogram peaked or flat relative to a normal
distribution.

Box plot and histogram for the "sepal length" variable from the Iris dataset.

converted by W eb2PDFConvert.com
Example:
Statistical analysis using Microsoft Excel (Iris.xls)
sepal length
Count 150
Minimum 4.3
Maximum 7.9
Mean 5.84
Median 5.8
Mode 5
Quartile 1 5.1
Range 3.6
Variance 0.69
Standard Deviation 0.83
Coefficient of Variation 14.2%
Skewness 0.31
Kurtosis -0.55
Exercise

converted by W eb2PDFConvert.com
Map > Data Mining > Explaining the Past > Data Exploration > Bivariate Analysis

Bivariate Analysis
Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship
between two variables, whether there exists an association and the strength of this association, or whether there are
differences between two variables and the significance of these differences. There are three types of bivariate analysis.

1. Numerical & Numerical


2. Categorical & Categorical
3. Numerical & Categorical

converted by W eb2PDFConvert.com
Map > Data Mining > Explaining the Past > Data Exploration > Bivariate Analysis > Categorical & Categorical

Bivariate Analysis - Categorical & Categorical

Stacked Column Chart


Stacked Column chart is a useful graph to visualize the relationship between two categorical variables. It compares the
percentage that each category from one variable contributes to a total across categories of the second variable.

Combination Chart
A combination chart uses two or more chart types to emphasize that the chart contains different kinds of information.
Here, we use a bar chart to show the distribution of one categorical variable and a line chart to show the percentage of
the selected category from the second categorical variable. The combination chart is the best visualization method to
demonstrate the predictability power of a predictor (X-axis) against a target (Y-axis).

Chi-square Test
The chi-square test can be used to determine the association between categorical variables. It is based on the difference
between the expected frequencies (e) and the observed frequencies (n) in one or more categories in the frequency table.
The chi-square distribution returns a probability for the computed chi-square and the degree of freedom. A probability of
zero shows a complete dependency between two categorical variables and a probability of one means that two
categorical variables are completely independent. Tchouproff Contingency Coefficient measures the amount of
dependency between two categorical variables.

converted by W eb2PDFConvert.com
Example:
The following frequency table (contingency table) with a chi-square of 10.67, degree of freedom (df) of 2 and probability
of 0.005 shows a significant dependency between two categorical variables (hair and eye colors).

Exercise

converted by W eb2PDFConvert.com
Map > Data Mining > Explaining the Past > Data Exploration > Bivariate Analysis > Numerical & Numerical

Bivariate Analysis - Numerical & Numerical

Scatter Plot
A scatter plot is a useful visual representation of the relationship between two numerical variables (attributes) and is
usually drawn before working out a linear correlation or fitting a regression line. The resulting pattern indicates the type
(linear or non-linear) and strength of the relationship between two variables. More information can be added to a two-
dimensional scatter plot, for example, we might label points with a code to indicate the level of a third variable. If we are
dealing with many variables in a data set, a way of presenting all possible scatter plots of two variables at a time is in a
scatter plot matrix.

Exercise

Linear Correlation

converted by W eb2PDFConvert.com
Linear correlation quantifies the strength of a linear relationship between two numerical variables. When there is no
correlation between two variables, there is no tendency for the values of one quantity to increase or decrease with the
values of the second quantity.

r only measures the strength of a linear relationship and is always between -1 and 1 where -1 means perfect negative
linear correlation and +1 means perfect positive linear correlation and zero means no linear correlation.

Example:
Temperature 83 64 72 81 70 68 65 75 71 85 80 72 69 75
Humidity 86 65 90 75 96 80 70 80 91 85 90 95 70 70

Variance Covariance Correlation


Temperature 40.10 19.78 0.32
Humidity 98.23
There is a weak linear correlation between Temperature and Humidity.

converted by W eb2PDFConvert.com
Map > Data Mining > Explaining the Past > Data Exploration > Bivariate Analysis > Categorical & Numerical

Bivariate Analysis - Categorical & Numerical

Line Chart with Error Bars


A line chart with error bars displays information as a series of data points connected by straight line segments. Each data
point is average of the numerical data for the corresponding category of the categorical variable with error bar showing
standard error. It is a way to summarize how pieces of information are related and how they vary depending on one
another (iris_linechart.xlsx).

Combination Chart
A combination chart uses two or more chart types to emphasize that the chart contains different kinds of information.
Here, we use a bar chart to show the distribution of a binned numerical variable and a line chart to show the percentage
of the selected category from the categorical variable. The combination chart is the best visualization method to
demonstrate the predictability power of a predictor (X-axis) against a target (Y-axis).

Z-test and t-test


Z-test and t-test are basically the same. They assess whether the averages of two groups are statistically different from
each other. This analysis is appropriate for comparing the averages of a numerical variable for two categories of a
categorical variable.

converted by W eb2PDFConvert.com
If the probability of Z is small, the difference between two averages is more significant.

t-test
When the n1 or n2 is less than 30 we use the t-test instead of the Z-test.

Example:
Is there a significant difference between the means (averages) of the numerical variable (Temperature) in two different
categories of the categorical variable (O-Ring Failure)?

O-Ring Failure Temperature


Y 53 56 57 70 70 70 75
N 63 66 67 67 67 68 69 70 72 73 75 76 76 78 79 80 81

t-test O-Ring Failure


Temperature Y N
Count 7 17
Mean 64.43 72.18
Variance 76.95 30.78
t -2.62
df 22
Probability 0.0156

The low probability (0.0156) means that the difference between the average temperature for failed O-Ring and the
average temperature for intact O-Ring is significant.

Analysis of Variance (ANOVA)


The ANOVA test assesses whether the averages of more than two groups are statistically different from each other. This
analysis is appropriate for comparing the averages of a numerical variable for more than two categories of a categorical
variable.

converted by W eb2PDFConvert.com
Example:
Is there a significant difference between the averages of the numerical variable (Humidity) in the three categories of the
categorical variable (Outlook)?

Outlook Humidity
overcast 86 65 90 75
rainy 96 80 70 80 91
sunny 85 90 95 70 70

Outlook Count Mean Variance


overcast 4 75 57.5
rainy 5 69.8 10.96
sunny 5 76.2 32.56

Source of Variation Sum of Squares Degree of freedom Mean Square F Value Probability
Between Groups 113.83 2 56.91 1.3987 0.288
Within Groups 447.60 11 40.69
Total 561.43 13

There is no significant difference between the averages of Humidity in the three categories of Outlook.

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future

Predicting the Future


Data mining predicts the future by means of modeling.

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling

Modeling
Predictive modeling is the process by which a model is created to predict an outcome. If the outcome is categorical it is
called classification and if the outcome is numerical it is called regression. Descriptive modeling or clustering is the
assignment of observations into clusters so that observations in the same cluster are similar. Finally, association rules
can find interesting associations amongst observations.

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Classification

Classification
Classification is a data mining task of predicting the value of a categorical variable (target or class) by building a model
based on one or more numerical and/or categorical variables (predictors or attributes).

Four main groups of classification algorithms are:


1. Frequency Table
ZeroR
OneR
Naive Bayesian
Decision Tree
2. Covariance Matrix
Linear Discriminant Analysis
Logistic Regression
3. Similarity Functions
K Nearest Neighbors
4. Others
Artificial Neural Network
Support Vector Machine

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Classification > ZeroR

ZeroR
ZeroR is the simplest classification method which relies on the target and ignores all predictors. ZeroR classifier simply
predicts the majority category (class). Although there is no predictability power in ZeroR, it is useful for determining a
baseline performance as a benchmark for other classification methods.

Algorithm
Construct a frequency table for the target and select its most frequent value.

Example:
"Play Golf = Yes" is the ZeroR model for the following dataset with an accuracy of 0.64.

Predictors Contribution
There is nothing to be said about the predictors contribution to the model because ZeroR does not use any of them.

Model Evaluation
The following confusion matrix shows that ZeroR only predicts the majority class correctly. As mentioned before, ZeroR is
only useful for determining a baseline performance for other classification methods.

Play Golf
Confusion Matrix
Yes No
Yes 9 5 Positive Predictive Value 0.64
ZeroR
No 0 0 Negative Predictive Value 0.00
Sensitivity Specificity
Accuracy = 0.64
1.00 0.00

Exercise

ZeroR Interactive

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Classification > OneR

OneR
OneR, short for "One Rule", is a simple, yet accurate, classification algorithm that generates one rule for each predictor in
the data, then selects the rule with the smallest total error as its "one rule". To create a rule for a predictor, we construct
a frequency table for each predictor against the target. It has been shown that OneR produces rules only slightly less
accurate than state-of-the-art classification algorithms while producing rules that are simple for humans to interpret.

OneR Algorithm
For each predictor,
For each value of that predictor, make a rule as follows;
Count how often each value of target (class) appears
Find the most frequent class
Make the rule assign that class to this value of the predictor
Calculate the total error of the rules of each predictor
Choose the predictor with the smallest total error.

Example:
Finding the best predictor with the smallest total error using OneR algorithm based on related frequency tables.

The best predictor is:

Predictors Contribution
Simply, the total error calculated from the frequency tables is the measure of each predictor contribution. A low total error
means a higher contribution to the predictability of the model.

Model Evaluation
The following confusion matrix shows significant predictability power. OneR does not generate score or probability, which
means evaluation charts (Gain, Lift, K-S and ROC) are not applicable.

Play Golf
Confusion Matrix
Yes No
Yes 7 2 Positive Predictive Value 0.78
OneR
No 2 3 Negative Predictive Value 0.60
Sensitivity Specificity
Accuracy = 0.71
0.78 0.60

converted by W eb2PDFConvert.com
Exercise

Try to invent a new OneR algorithm by using ANOVA and Chi2 test.

OneR Interactive

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Classification > Naive Bayesian

Naive Bayesian
The Naive Bayesian classifier is based on Bayes theorem with independence assumptions between predictors. A Naive
Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for
very large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used
because it often outperforms more sophisticated classification methods.

Algorithm
Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x), and P(x|c). Naive Bayes
classifier assume that the effect of the value of a predictor (x) on a given class (c) is independent of the values of other
predictors. This assumption is called class conditional independence.

P(c|x) is the posterior probability of class (target) given predictor (attribute).


P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.

Example:
The posterior probability can be calculated by first, constructing a frequency table for each attribute against the target.
Then, transforming the frequency tables to likelihood tables and finally use the Naive Bayesian equation to calculate the
posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.

The zero-frequency problem


Add 1 to the count for every attribute value-class combination (Laplace estimator) when an attribute value
(Outlook=Overcast) doesnt occur with every class value (Play Golf=no).

Numerical Predictors
Numerical variables need to be transformed to their categorical counterparts (binning) before constructing their frequency
tables. The other option we have is using the distribution of the numerical variable to have a good guess of the frequency.
For example, one common practice is to assume normal distributions for numerical variables.

converted by W eb2PDFConvert.com
The probability density function for the normal distribution is defined by two parameters (mean and standard deviation).

Example:
Humidity Mean StDev
yes 86 96 80 65 70 80 70 90 75 79.1 10.2
Play Golf
no 85 90 70 95 91 86.2 9.7

Predictors Contribution
Kononenko's information gain as a sum of information contributed by each attribute can offer an explanation on how
values of the predictors influence the class probability.

The contribution of predictors can also be visualized by plotting nomograms. Nomogram plots log odds ratios for each
value of each predictor. Lengths of the lines correspond to spans of odds ratios, suggesting importance of the related
predictor. It also shows impacts of individual values of the predictor.

Exercise

Try to invent a real time Bayesian classifier. You should be able to add or remove data and variables (predictors and
classes) on the fly.

Naive Bayesian Interactive

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Classification > Decision Tree

Decision Tree - Classification


Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into
smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final
result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g.,
Sunny, Overcast and Rainy). Leaf node (e.g., Play) represents a classification or decision. The topmost decision node in a
tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical
data.

Algorithm

The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy search
through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a
decision tree.

Entropy
A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances
with similar values (homogenous). ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is
completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one.

To build a decision tree, we need to calculate two types of entropy using frequency tables as follows:

a) Entropy using the frequency table of one attribute:

converted by W eb2PDFConvert.com
b) Entropy using the frequency table of two attributes:

Information Gain
The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision
tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

Step 1: Calculate entropy of the target.

Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added
proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The
result is the Information Gain, or decrease in entropy.

converted by W eb2PDFConvert.com
Step 3: Choose attribute with the largest information gain as the decision node.

Step 4a: A branch with entropy of 0 is a leaf node.

Step 4b: A branch with entropy more than 0 needs further splitting.

Step 5: The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.

Decision Tree to Decision Rules


A decision tree can easily be transformed to a set of rules by mapping from the root node to the leaf nodes one by one.

converted by W eb2PDFConvert.com
Decision Trees - Issues
Working with continuous attributes (binning)
Avoiding overfitting
Super Attributes (attributes with many values)
Working with missing values

Exercise

Try to invent a new algorithm to construct a decision tree from data using Chi2 test.

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Classification > Linear Discriminant Analysis

Linear Discriminant Analysis


Linear Discriminant Analysis (LDA) is a classification method originally developed in 1936 by R. A. Fisher. It is simple,
mathematically robust and often produces models whose accuracy is as good as more complex methods.

Algorithm
LDA is based upon the concept of searching for a linear combination of variables (predictors) that best separates two
classes (targets). To capture the notion of separability, Fisher defined the following score function.

Given the score function, the problem is to estimate the linear coefficients that maximize the score which can be solved
by the following equations.

One way of assessing the effectiveness of the discrimination is to calculate the Mahalanobis distance between two
groups. A distance greater than 3 means that in two averages differ by more than 3 standard deviations. It means that the
overlap (probability of misclassification) is quite small.

Finally, a new point is classified by projecting it onto the maximally separating direction and classifying it as C1 if:

Example:
Suppose we received a dataset from a bank regarding its small business clients who defaulted (red square) and those that

converted by W eb2PDFConvert.com
did not (blue circle) separated by delinquent days (DAYSDELQ) and number of months in business (BUSAGE). We use LDA
to find an optimal linear model that best separates two classes (default and non-default).

The first step is to calculate the mean (average) vectors, covariance matrices and class probabilities.

Then, we calculate pooled covariance matrix and finally the coefficients of the linear model.

A Mahalanobis distance of 2.32 shows a small overlap between two groups which means a good separation between
classes by the linear model.

Predictors Contribution
A simple linear correlation between the model scores and predictors can be used to test which predictors contribute
significantly to the discriminant function. Correlation varies from -1 to 1, with -1 and 1 meaning the highest contribution
but in different directions and 0 means no contribution at all.

Quadratic Discriminant Analysis (QDA)


QDA is a general discriminant function with a quadratic decision boundaries which can be used to classify datasets with
two or more classes. QDA has more predictability power than LDA but it needs to estimate the covariance matrix for each
classes.

converted by W eb2PDFConvert.com
where Ck is the covariance matrix for the class k (-1 means inverse matrix), |Ck | is the determinant of the covariance
matrix Ck , and P(ck ) is the prior probability of the class k. The classification rule is simply to find the class with highest Z
value.

Try to invent a real time LDA classifier. You should be able to add or remove data and variables (predictors and
classes) on the fly.

LDA Interactive

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Classification > Logistic Regression

Logistic Regression
Logistic regression predicts the probability of an outcome that can only have two values (i.e. a dichotomy). The prediction
is based on the use of one or several predictors (numerical and categorical). A linear regression is not appropriate for
predicting the value of a binary variable for two reasons:

A linear regression will predict values outside the acceptable range (e.g. predicting probabilities
outside the range 0 to 1)
Since the dichotomous experiments can only have one of two possible values for each experiment, the residuals will
not be normally distributed about the predicted line.

On the other hand, a logistic regression produces a logistic curve, which is limited to values between 0 and 1. Logistic
regression is similar to a linear regression, but the curve is constructed using the natural logarithm of the odds of the
target variable, rather than the probability. Moreover, the predictors do not have to be normally distributed or have equal
variance in each group.

In the logistic regression the constant (b0) moves the curve left and right and the slope (b1) defines the steepness of the
curve. By simple transformation, the logistic regression equation can be written in terms of an odds ratio.

Finally, taking the natural log of both sides, we can write the equation in terms of log-odds (logit) which is a linear
function of the predictors. The coefficient (b1) is the amount the logit (log-odds) changes with a one unit change in x.

As mentioned before, logistic regression can handle any number of numerical and/or categorical variables.

There are several analogies between linear regression and logistic regression. Just as ordinary least square regression is
the method used to estimate coefficients for the best fit line in linear regression, logistic regression uses maximum
likelihood estimation (MLE) to obtain the model coefficients that relate predictors to the target. After this initial function
is estimated, the process is repeated until LL (Log Likelihood) does not change significantly.

converted by W eb2PDFConvert.com
A pseudo R2 value is also available to indicate the adequacy of the regression model. Likelihood ratio test is a test of
the significance of the difference between the likelihood ratio for the baseline model minus the likelihood ratio for a
reduced model. This difference is called "model chi-square. Wald test is used to test the statistical significance of each
coefficient (b) in the model (i.e., predictors contribution).

Pseudo R2
There are several measures intended to mimic the R 2 analysis to evaluate the goodness-of-fit of logistic models, but they
cannot be interpreted as one would interpret an R 2 and different pseudo R 2 can arrive at very different values. Here we
discuss three pseudo R 2 measures.

Pseudo R 2 Equation Description


'p' is the logistic model predicted probability.
Efron's The model residuals are squared, summed, and
divided by the total variability in the dependent
variable.

The ratio of the log-likelihoods suggests the


McFadden's level of improvement over the intercept model
offered by the full model.

The number of records correctly predicted, given


a cutoff point of .5 divided by the total count of
Count cases. This is equal to the accuracy of a
classification model.

Likelihood Ratio Test


The likelihood ratio test provides the means for comparing the likelihood of the data under one model (e.g., full model)
against the likelihood of the data under another, more restricted model (e.g., intercept model).

where 'p' is the logistic model predicted probability. The next step is to calculate the difference between these two log-
likelihoods.

The difference between two likelihoods is multiplied by a factor of 2 in order to be assessed for statistical significance
using standard significance levels (Chi2 test). The degrees of freedom for the test will equal the difference in the number
of parameters being estimated under the models (e.g., full and intercept).

Wald test
A Wald test is used to evaluate the statistical significance of each coefficient (b) in the model.

where W is the Wald's statistic with a normal distribution (like Z-test), b is the coefficient and SE is its standard error. The
W value is then squared, yielding a Wald statistic with a chi-square distribution.

Predictors Contributions
The Wald test is usually used to assess the significance of prediction of each predictor. Another indicator of contribution
of a predictor is exp(b) or odds-ratio of coefficient which is the amount the logit (log-odds) changes, with a one unit
change in the predictor (x).

converted by W eb2PDFConvert.com
Exercise

Logistic Regression Interactive

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Classification > K Nearest Neighbors

K Nearest Neighbors - Classification


K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity
measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the
beginning of 1970s as a non-parametric technique.

Algorithm
A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst
its K nearest neighbors measured by a distance function. If K = 1, then the case is simply assigned to the class of its
nearest neighbor.

It should also be noted that all three distance measures are only valid for continuous variables. In the instance of
categorical variables the Hamming distance must be used. It also brings up the issue of standardization of the numerical
variables between 0 and 1 when there is a mixture of numerical and categorical variables in the dataset.

Choosing the optimal value for K is best done by first inspecting the data. In general, a large K value is more precise as it
reduces the overall noise but there is no guarantee. Cross-validation is another way to retrospectively determine a good K
value by using an independent dataset to validate the K value. Historically, the optimal K for most datasets has been
between 3-10. That produces much better results than 1NN.

Example:
Consider the following data concerning credit default. Age and Loan are two numerical variables (predictors) and Default
is the target.

converted by W eb2PDFConvert.com
We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. If
K=1 then the nearest neighbor is the last case in the training set with Default=Y.

D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y

With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case
is again Default=Y.

Standardized Distance
One major drawback in calculating distance measures directly from the training set is in the case where variables have
different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is
based on annual income in dollars, and the other is based on age in years then income will have a much higher influence
on the distance calculated. One solution is to standardize the training set as shown below.

Using the standardized distance on the same training set, the unknown case returned a different neighbor which is not a
good sign of robustness.

Exercise

converted by W eb2PDFConvert.com
Try to invent a new KNN algorithm using Linear Correlation.
KNN Interactive

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Classification/Regression > Artificial Neural Network

Artificial Neural Network


An artificial neutral network (ANN) is a system that is based on the biological neural network, such as the brain. The brain
has approximately 100 billion neurons, which communicate through electro-chemical signals. The neurons are connected
through junctions called synapses. Each neuron receives thousands of connections with other neurons, constantly
receiving incoming signals to reach the cell body. If the resulting sum of the signals surpasses a certain threshold, a
response is sent through the axon. The ANN attempts to recreate the computational mirror of the biological neural
network, although it is not comparable since the number and complexity of neurons and the used in a biological neural
network is many times more than those in an artificial neutral network.

An ANN is comprised of a network of artificial neurons (also known as "nodes"). These nodes are connected to each other,
and the strength of their connections to one another is assigned a value based on their strength: inhibition (maximum
being -1.0) or excitation (maximum being +1.0). If the value of the connection is high, then it indicates that there is a
strong connection. Within each node's design, a transfer function is built in. There are three types of neurons in an ANN,
input nodes, hidden nodes, and output nodes.

The input nodes take in information, in the form which can be numerically expressed. The information is presented as
activation values, where each node is given a number, the higher the number, the greater the activation. This information
is then passed throughout the network. Based on the connection strengths (weights), inhibition or excitation, and transfer
functions, the activation value is passed from node to node. Each of the nodes sums the activation values it receives; it
then modifies the value based on its transfer function. The activation flows through the network, through hidden layers,
until it reaches the output nodes. The output nodes then reflect the input in a meaningful way to the outside world. The
difference between predicted value and actual value (error) will be propagated backward by apportioning them to each
node's weights according to the amount of this error the node is responsible for (e.g., gradient descent algorithm).

converted by W eb2PDFConvert.com
Transfer (Activation) Functions

The transfer function translates the input signals to output signals. Four types of transfer functions are commonly used,
Unit step (threshold), sigmoid, piecewise linear, and Gaussian.

Unit step (threshold)


The output is set at one of two levels, depending on whether the total input is greater than or less than some threshold
value.

Sigmoid
The sigmoid function consists of 2 functions, logistic and tangential. The values of logistic function range from 0 and 1
and -1 to +1 for tangential function.

Piecewise Linear
The output is proportional to the total weighted output.

converted by W eb2PDFConvert.com
Gaussian
Gaussian functions are bell-shaped curves that are continuous. The node output (high/low) is interpreted in terms of class
membership (1/0), depending on how close the net input is to a chosen value of average.

Algorithm

There are different types of neural networks, but they are generally classified into feed-forward and feed-back networks.

A feed-forward network is a non-recurrent network which contains inputs, outputs, and hidden layers; the signals can
only travel in one direction. Input data is passed onto a layer of processing elements where it performs calculations. Each
processing element makes its computation based upon a weighted sum of its inputs. The new calculated values then
become the new input values that feed the next layer. This process continues until it has gone through all the layers and
determines the output. A threshold transfer function is sometimes used to quantify the output of a neuron in the output
layer. Feed-forward networks include Perceptron (linear and non-linear) and Radial Basis Function networks. Feed-forward
networks are often used in data mining.

A feed-back network has feed-back paths meaning they can have signals traveling in both directions using loops. All
possible connections between neurons are allowed. Since loops are present in this type of network, it becomes a non-
linear dynamic system which changes continuously until it reaches a state of equilibrium. Feed-back networks are often
used in associative memories and optimization problems where the network looks for the best arrangement of
interconnected factors.

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Classification > Support Vector Machine

Support Vector Machine - Classification (SVM)


A Support Vector Machine (SVM) performs classification by finding the hyperplane that maximizes the margin between the
two classes. The vectors (cases) that define the hyperplane are the support vectors.

Algorithm
1. Define an optimal hyperplane: maximize margin
2. Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications.
3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem
so that data is mapped implicitly to this space.

To define an optimal hyperplane we need to maximize the width of the margin (w).

We find w and b by solving the following objective function using Quadratic Programming.

converted by W eb2PDFConvert.com
The beauty of SVM is that if the data is linearly separable, there is a unique global minimum value. An ideal SVM analysis
should produce a hyperplane that completely separates the vectors (cases) into two non-overlapping classes. However,
perfect separation may not be possible, or it may result in a model with so many cases that the model does not classify
correctly. In this situation SVM finds the hyperplane that maximizes the margin and minimizes the misclassifications.

The algorithm tries to maintain the slack variable to zero while maximizing margin. However, it does not minimize the
number of misclassifications (NP-complete problem) but the sum of distances from the margin hyperplanes.

The simplest way to separate two groups of data is with a straight line (1 dimension), flat plane (2 dimensions) or an N-
dimensional hyperplane. However, there are situations where a nonlinear region can separate the groups more efficiently.
SVM handles this by using a kernel function (nonlinear) to map the data into a different space where a hyperplane (linear)
cannot be used to do the separation. It means a non-linear function is learned by a linear learning machine in a high-
dimensional feature space while the capacity of the system is controlled by a parameter that does not depend on the
dimensionality of the space. This is called kernel trick which means the kernel function transform the data into a higher
dimensional feature space to make it possible to perform the linear separation.

Map data into new space, then take the inner product of the new vectors. The image of the inner product of the data is

converted by W eb2PDFConvert.com
the inner product of the images of the data. Two kernel functions are shown below.

Exercise

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Regression

Regression
Regression is a data mining task of predicting the value of target (numerical variable) by building a model based on one or
more predictors (numerical and categorical variables).

1. Frequency Table
Decision Tree
2. Covariance Matrix
Multiple Linear Regression
3. Similarity Function
K Nearest Neighbors
4. Others
Artificial Neural Network
Support Vector Machine

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Regression > Decision Tree

Decision Tree - Regression


Decision tree builds regression or classification models in the form of a tree structure. It brakes down a dataset into
smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final
result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g.,
Sunny, Overcast and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a
decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root
node. Decision trees can handle both categorical and numerical data.

Decision Tree Algorithm

The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy search
through the space of possible branches with no backtracking. The ID3 algorithm can be used to construct a decision tree
for regression by replacing Information Gain with Standard Deviation Reduction.

Standard Deviation
A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances
with similar values (homogenous). We use standard deviation to calculate the homogeneity of a numerical sample. If the
numerical sample is completely homogeneous its standard deviation is zero.

a) Standard deviation for one attribute:

b) Standard deviation for two attributes:

converted by W eb2PDFConvert.com
Standard Deviation Reduction
The standard deviation reduction is based on the decrease in standard deviation after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest standard deviation reduction (i.e., the
most homogeneous branches).

Step 1: The standard deviation of the target is calculated.

Standard deviation (Hours Played) = 9.32

Step 2: The dataset is then split on the different attributes. The standard deviation for each branch is calculated. The
resulting standard deviation is subtracted from the standard deviation before the split. The result is the standard
deviation reduction.

Step 3: The attribute with the largest standard deviation reduction is chosen for the decision node.

converted by W eb2PDFConvert.com
Step 4a: Dataset is divided based on the values of the selected attribute.

Step 4b: A branch set with standard deviation more than 0 needs further splitting.
In practice, we need some termination criteria. For example, when standard deviation for the branch becomes smaller
than a certain fraction (e.g., 5%) of standard deviation for the full dataset OR when too few instances remain in the
branch (e.g., 3).

Step 5: The process is run recursively on the non-leaf branches, until all data is processed.
When the number of instances is more than one at a leaf node we calculate the average as the final value for the target.

Exercise

Try to invent a new algorithm to construct a decision tree from data using MLR instead of average at the leaf node.

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Regression > Multiple Linear Regression

Multiple Linear Regression


Multiple linear regression (MLR) is a method used to model the linear relationship between a dependent variable (target)
and one or more independent variables (predictors).

MLR is based on ordinary least squares (OLS), the model is fit such that the sum-of-squares of differences of observed
and predicted values is minimized.

The MLR model is based on several assumptions (e.g., errors are normally distributed with zero mean and constant
variance). Provided the assumptions are satisfied, the regression estimators are optimal in the sense that they are
unbiased, efficient, and consistent. Unbiased means that the expected value of the estimator is equal to the true value of
the parameter. Efficient means that the estimator has a smaller variance than any other estimator. Consistent means that
the bias and variance of the estimator approach zero as the sample size approaches infinity.

How good is the model?


R2 also called as coefficient of determination summarizes the explanatory power of the regression model and is computed
from the sums-of-squares terms.

R 2 describes the proportion of variance of the dependent variable explained by the regression model. If the regression
model is perfect, SSE is zero, and R 2 is 1. If the regression model is a total failure, SSE is equal to SST, no variance is
explained by regression, and R 2 is zero. It is important to keep in mind that there is no direct relationship between high R 2
and causation.

converted by W eb2PDFConvert.com
How significant is the model?
F-ratio estimates the statistical significance of the regression model and is computed from the mean squared terms in the
ANOVA table. The significance of the F-ratio is obtained by referring to the F distribution table using two degrees of
freedom (dfMSR, dfMSE). p is the number of independent variables (e.g., p is one for the simple linear regression).

The advantage of the F-ratio over R 2 is that the F-ratio incorporates sample size and number of predictors in assessment
of significance of the regression model. A model can have a high R 2 and still not be statistically significant.

How significant are the coefficients?


If the regression model is significantly good, we can use t-test to estimate the statistical significance of each coefficient.

Example

Multicolinearity
A high degree of multicolinearity between predictors produces unreliable regression coefficient estimates. Signs of
multicolinearity include:
1. High correlation between pairs of predictor variables.
2. Regression coefficients whose signs or magnitudes do not make good physical sense.
3. Statistically nonsignificant regression coefficients on important predictors.
4. Extreme sensitivity of sign or magnitude of regression coefficients to insertion or deletion of a predictor.

The diagonal values in the (X'X)-1 matrix called Variance Inflation Factors (VIFs) and they are very useful measures of
multicolinearity. If any VIF exceed 5, multicolinearity is a problem.

Model Selection
A frequent problem in data mining is to avoid predictors that do not contribute significantly to model prediction. First, It
has been shown that dropping predictors that have insignificant coefficients can reduce the average error of predictions.
Second, estimation of regression coefficients are likely to be unstable due to multicollinearity in models with many
variables. Finally, a simpler model is a better model with more insight into the influence of predictors in models. There
are two main methods of model selection:
Forward selection, the best predictors are entered in the model, one by one.
Backward Elimination, the worst predictors are eliminated from the model, one by one.

Exercise

Linear Regression Interactive

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Regression > K Nearest Neighbors

K Nearest Neighbors - Regression


K nearest neighbors is a simple algorithm that stores all available cases and predict the numerical target based on a
similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already
in the beginning of 1970s as a non-parametric technique.

Algorithm
A simple implementation of KNN regression is to calculate the average of the numerical target of the K nearest
neighbors. Another approach uses an inverse distance weighted average of the K nearest neighbors. KNN regression uses
the same distance functions as KNN classification.

The above three distance measures are only valid for continuous variables. In the case of categorical variables you must
use the Hamming distance, which is a measure of the number of instances in which corresponding symbols are different
in two strings of equal length.

Choosing the optimal value for K is best done by first inspecting the data. In general, a large K value is more precise as it
reduces the overall noise; however, the compromise is that the distinct boundaries within the feature space are blurred.
Cross-validation is another way to retrospectively determine a good K value by using an independent data set to validate
your K value. The optimal K for most datasets is 10 or more. That produces much better results than 1-NN.

Example:
Consider the following data concerning House Price Index or HPI. Age and Loan are two numerical variables (predictors)
and HPI is the numerical target.

converted by W eb2PDFConvert.com
We can now use the training set to classify an unknown case (Age=33 and Loan=$150,000) using Euclidean distance. If
K=1 then the nearest neighbor is the last case in the training set with HPI=264.

D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> HPI = 264

By having K=3, the prediction for HPI is equal to the average of HPI for the top three neighbors.

HPI = (264+139+139)/3 = 180.7

Standardized Distance
One major drawback in calculating distance measures directly from the training set is in the case where variables have
different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is
based on annual income in dollars, and the other is based on age in years then income will have a much higher influence
on the distance calculated. One solution is to standardize the training set as shown below.

As mentioned in KNN Classification using the standardized distance on the same training set, the unknown case returned
a different neighbor which is not a good sign of robustness.

Exercise

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Classification/Regression > Artificial Neural Network

Artificial Neural Network


An artificial neutral network (ANN) is a system that is based on the biological neural network, such as the brain. The brain
has approximately 100 billion neurons, which communicate through electro-chemical signals. The neurons are connected
through junctions called synapses. Each neuron receives thousands of connections with other neurons, constantly
receiving incoming signals to reach the cell body. If the resulting sum of the signals surpasses a certain threshold, a
response is sent through the axon. The ANN attempts to recreate the computational mirror of the biological neural
network, although it is not comparable since the number and complexity of neurons and the used in a biological neural
network is many times more than those in an artificial neutral network.

An ANN is comprised of a network of artificial neurons (also known as "nodes"). These nodes are connected to each other,
and the strength of their connections to one another is assigned a value based on their strength: inhibition (maximum
being -1.0) or excitation (maximum being +1.0). If the value of the connection is high, then it indicates that there is a
strong connection. Within each node's design, a transfer function is built in. There are three types of neurons in an ANN,
input nodes, hidden nodes, and output nodes.

The input nodes take in information, in the form which can be numerically expressed. The information is presented as
activation values, where each node is given a number, the higher the number, the greater the activation. This information
is then passed throughout the network. Based on the connection strengths (weights), inhibition or excitation, and transfer
functions, the activation value is passed from node to node. Each of the nodes sums the activation values it receives; it
then modifies the value based on its transfer function. The activation flows through the network, through hidden layers,
until it reaches the output nodes. The output nodes then reflect the input in a meaningful way to the outside world. The
difference between predicted value and actual value (error) will be propagated backward by apportioning them to each
node's weights according to the amount of this error the node is responsible for (e.g., gradient descent algorithm).

converted by W eb2PDFConvert.com
Transfer (Activation) Functions

The transfer function translates the input signals to output signals. Four types of transfer functions are commonly used,
Unit step (threshold), sigmoid, piecewise linear, and Gaussian.

Unit step (threshold)


The output is set at one of two levels, depending on whether the total input is greater than or less than some threshold
value.

Sigmoid
The sigmoid function consists of 2 functions, logistic and tangential. The values of logistic function range from 0 and 1
and -1 to +1 for tangential function.

Piecewise Linear
The output is proportional to the total weighted output.

converted by W eb2PDFConvert.com
Gaussian
Gaussian functions are bell-shaped curves that are continuous. The node output (high/low) is interpreted in terms of class
membership (1/0), depending on how close the net input is to a chosen value of average.

Algorithm

There are different types of neural networks, but they are generally classified into feed-forward and feed-back networks.

A feed-forward network is a non-recurrent network which contains inputs, outputs, and hidden layers; the signals can
only travel in one direction. Input data is passed onto a layer of processing elements where it performs calculations. Each
processing element makes its computation based upon a weighted sum of its inputs. The new calculated values then
become the new input values that feed the next layer. This process continues until it has gone through all the layers and
determines the output. A threshold transfer function is sometimes used to quantify the output of a neuron in the output
layer. Feed-forward networks include Perceptron (linear and non-linear) and Radial Basis Function networks. Feed-forward
networks are often used in data mining.

A feed-back network has feed-back paths meaning they can have signals traveling in both directions using loops. All
possible connections between neurons are allowed. Since loops are present in this type of network, it becomes a non-
linear dynamic system which changes continuously until it reaches a state of equilibrium. Feed-back networks are often
used in associative memories and optimization problems where the network looks for the best arrangement of
interconnected factors.

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Regression > Support Vector Machine

Support Vector Machine - Regression (SVR)


Support Vector Machine can also be used as a regression method, maintaining all the main features that characterize the
algorithm (maximal margin). The Support Vector Regression (SVR) uses the same principles as the SVM for classification,
with only a few minor differences. First of all, because output is a real number it becomes very difficult to predict the
information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in
approximation to the SVM which would have already requested from the problem. But besides this fact, there is also a
more complicated reason, the algorithm is more complicated therefore to be taken in consideration. However, the main
idea is always the same: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind
that part of the error is tolerated.

Linear SVR

Non-linear SVR
The kernel functions transform the data into a higher dimensional feature space to make it possible to perform the linear
separation.

converted by W eb2PDFConvert.com
Kernel functions

Exercise

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Clustering

Clustering
A cluster is a subset of data which are similar. Clustering (also called unsupervised learning) is the process of dividing a
dataset into groups such that the members of each group are as similar (close) as possible to one another, and different
groups are as dissimilar (far) as possible from one another. Clustering can uncover previously undetected relationships in
a dataset. There are many applications for cluster analysis. For example, in business, cluster analysis can be used to
discover and characterize customer segments for marketing purposes and in biology, it can be used for classification of
plants and animals given their features.

Two main groups of clustering algorithms are:


1. Hierarchical
Agglomerative
Divisive
2. Partitive
K Means
Self-Organizing Map

A good clustering method requirements are:


The ability to discover some or all of the hidden clusters.
Within-cluster similarity and between-cluster dissimilarity.
Ability to deal with various types of attributes.
Can deal with noise and outliers.
Can handle high dimensionality.
Scalable, Interpretable and usable.

An important issue in clustering is how to determine the similarity between two objects, so that clusters can be formed
from objects with high similarity within clusters and low similarity between clusters. Commonly, to measure similarity or
dissimilarity between objects, a distance measure such as Euclidean, Manhattan and Minkowski is used. A distance
function returns a lower value for pairs of objects that are more similar to one another.

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Clustering > Hierarchical

Hierarchical Clustering
Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all
files and folders on the hard disk are organized in a hierarchy. There are two types of hierarchical clustering, Divisive and
Agglomerative.

Divisive method
In this method we assign all of the observations to a single cluster and then partition the cluster to two least similar
clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation.

Agglomerative method
In this method we assign each observation to its own cluster. Then, compute the similarity (e.g., distance) between each
of the clusters and join the two most similar clusters. Finally, repeat steps 2 and 3 until there is only a single cluster left.
The related algorithm is shown below.

Before any clustering is performed, it is required to determine the proximity matrix containing the distance between each
point using a distance function. Then, the matrix is updated to display the distance between each cluster. The following
three methods differ in how the distance between each cluster is measured.

Single Linkage
In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between
two points in each cluster. For example, the distance between clusters r and s to the left is equal to the length of the
arrow between their two closest points.

converted by W eb2PDFConvert.com
Complete Linkage
In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between
two points in each cluster. For example, the distance between clusters r and s to the left is equal to the length of the
arrow between their two furthest points.

Average Linkage
In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between
each point in one cluster to every point in the other cluster. For example, the distance between clusters r and s to the
left is equal to the average length each arrow between connecting the points of one cluster to the other.

Exercise

Hierarchical Clustering Interactive

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Clustering > K-Means

K-Means Clustering
K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the
nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of
clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data. The
objective of K-Means clustering is to minimize total intra-cluster variance, or, the squared error function:

Algorithm
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center according to the Euclidean distance function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds.

K-Means is relatively an efficient method. However, we need to specify the number of clusters, in advance and the final
results are sensitive to initialization and often terminates at a local optimum. Unfortunately there is no global theoretical
method to find the optimal number of clusters. A practical approach is to compare the outcomes of multiple runs with
different k and choose the best one based on a predefined criterion. In general, a large k probably decreases the error but
increases the risk of overfitting.

Example:
Suppose we want to group the visitors to a website using just their age (a one-dimensional space) as follows:
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
Initial clusters:
Centroid (C1) = 16 [16]
Centroid (C2) = 22 [22]
Iteration 1:
C1 = 15.33 [15,15,16]
C2 = 36.25 [19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65]
Iteration 2:
C1 = 18.56 [15,15,16,19,19,20,20,21,22]
C2 = 45.90 [28,35,40,41,42,43,44,60,61,65]
Iteration 3:

converted by W eb2PDFConvert.com
C1 = 19.50 [15,15,16,19,19,20,20,21,22,28]
C2 = 47.89 [35,40,41,42,43,44,60,61,65]
Iteration 4:
C1 = 19.50 [15,15,16,19,19,20,20,21,22,28]
C2 = 47.89 [35,40,41,42,43,44,60,61,65]

No change between iterations 3 and 4 has been noted. By using clustering, 2 groups have been identified 15-28 and 35-
65. The initial choice of centroids can affect the output clusters, so the algorithm is often run multiple times with different
starting conditions in order to get a fair view of what the clusters should be.

Exercise

K Means Interactive

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Clustering > Self Organizing Map

Self Organizing Map


Self organizing Map (SOM) is used for visualization and analysis of high-dimensional datasets. SOM facilitate
presentation of high dimensional datasets into lower dimensional ones, usually 1-D, 2-D and 3-D. It is an unsupervised
learning algorithm, and does not require a target vector since it learns to classify data without supervision. A SOM is
formed from a grid of nodes or units to which the input data are presented. Every node is connected to the input, and
there is no connection between the nodes. SOM is a topology preserving technique and keeps the neighborhood relations
in its mapping presentation.

Algorithm
1- Initialization of each nodes weights with a random number between 0 and 1

2- Choosing a random input vector from training dataset

3- Calculating the Best Matching Unit (BMU). Each node is examined to find the one which its weights are most similar to
the input vector. This unit is known as the Best Matching Unit (BMU) since its vector is most similar to the input vector.
This selection is done by Euclidean distance formula, which is a measure of similarity between two datasets. The distance
between the input vector and the weights of node is calculated in order to find the BMU.

4- Calculating the size of the neighborhood around the BMU.The size of the neighborhood around the BMU is decreasing
with an exponential decay function. It shrinks on each iteration until reaching just the BMU.

converted by W eb2PDFConvert.com
5- Modification of nodes weights of the BMU and neighboring nodes, so that their weight gets more similar to the weight
of input vector. The weight of every node within the neighborhood is adjusted, having greater change for neighbors closer
to the BMU.

The decay of learning rate is calculated for each iteration.

As training goes on, the neighborhood gradually shrinks. At the end of training, the neighborhoods have shrunk to zero
size.

The influence rate shows amount of influence a node's distance from the BMU has on its learning. In the simplest form
influence rate is equal to 1 for all the nodes close to the BMU and zero for others, but a Gaussian function is common
too. Finally, from a random distribution of weights and through much iteration, SOM is able to arrive at a map of stable
zones. At the end, interpretation of data is to be done by human but SOM is a great technique to present the invisible
patterns in the data.

Exercise

SOM Interactive (only Windows )

converted by W eb2PDFConvert.com
Map > Data Mining > Predicting the Future > Modeling > Association Rules

Association Rules
Association Rules find all sets of items (itemsets) that have support greater than the minimum support and then using the
large itemsets to generate the desired rules that have confidence greater than the minimum confidence. The lift of a rule
is the ratio of the observed support to that expected if X and Y were independent. A typical and widely used example of
association rules application is market basket analysis.

Example:

AIS Algorithm
1. Candidate itemsets are generated and counted on-the-fly as the database is scanned.
2. For each transaction, it is determined which of the large itemsets of the previous pass are contained in this
transaction.
3. New candidate itemsets are generated by extending these large itemsets with other items in this transaction.

The disadvantage of the AIS algorithm is that it results in unnecessarily generating and counting too many candidate
itemsets that turn out to be small.

SETM Algorithm
1. Candidate itemsets are generated on-the-fly as the database is scanned, but counted at the end of the pass.
converted by W eb2PDFConvert.com
1. Candidate itemsets are generated on-the-fly as the database is scanned, but counted at the end of the pass.
2. New candidate itemsets are generated the same way as in AIS algorithm, but the TID of the generating transaction
is saved with the candidate itemset in a sequential structure.
3. At the end of the pass, the support count of candidate itemsets is determined by aggregating this sequential
structure.

The SETM algorithm has the same disadvantage of the AIS algorithm. Another disadvantage is that for each candidate
itemset, there are as many entries as its support value.

Apriori Algorithm
1. Candidate itemsets are generated using only the large itemsets of the previous pass without considering the
transactions in the database.
2. The large itemset of the previous pass is joined with itself to generate all itemsets whose size is higher by 1.
3. Each generated itemset that has a subset which is not large is deleted. The remaining itemsets are the candidate
ones.

The Apriori algorithm takes advantage of the fact that any subset of a frequent itemset is also a frequent itemset. The
algorithm can therefore, reduce the number of candidates being considered by only exploring the itemsets whose support
count is greater than the minimum support count. All infrequent itemsets can be pruned if it has an infrequent subset.

AprioriTid Algorithm
1. The database is not used at all for counting the support of candidate itemsets after the first pass.
2. The candidate itemsets are generated the same way as in Apriori algorithm.
3. Another set C is generated of which each member has the TID of each transaction and the large itemsets present in
this transaction. This set is used to count the support of each candidate itemset.

converted by W eb2PDFConvert.com
The advantage is that the number of entries in C may be smaller than the number of transactions in the database,
especially in the later passes.

AprioriHybrid Algorithm
Apriori does better than AprioriTid in the earlier passes. However, AprioriTid does better than Apriori in the later passes.
Hence, a hybrid algorithm can be designed that uses Apriori in the initial passes and switches to AprioriTid when it
expects that the set C will fit in memory.

Exercise

converted by W eb2PDFConvert.com

Vous aimerez peut-être aussi