Vous êtes sur la page 1sur 9

MISY 631

Final Review
Calculators will be provided for the exam.
Topic 1: Naïve Bayes Classification (classification models)
1. Be able to understand and apply Bayes’ rule
Another way of classifying data,
𝑷(𝑩|𝑨)𝑷(𝑨)
Formula: 𝑷(𝑨|𝑩) = 𝑷(𝑩)
 P(AB) = P(A)P(B|A) independent, result of A does not affect P(B)
 P(BA) = P(B)P(A|B)
 P(A)P(B|A) = P(B)P(A|B)
P(AB)=/P(A)*P(B) dependent
2. Be able to calculate the probability for an instance (e.g., the probability given in slide
10, lecture 8) using Naïve Bayes Classification
𝑷(𝐃𝐞𝐟𝐚𝐮𝐥𝐭 = 𝐲𝐞𝐬|𝐁𝐚𝐥𝐚𝐧𝐜𝐞 >= 𝟓𝟎𝐊, 𝐀𝐠𝐞 < 𝟒𝟓, 𝐄𝐦𝐩𝐥𝐨𝐲𝐞𝐝 = 𝐍𝐨)
=
𝑷(𝐁𝐚𝐥𝐚𝐧𝐜𝐞 >= 𝟓𝟎𝐊, 𝐀𝐠𝐞 < 𝟒𝟓, 𝐄𝐦𝐩𝐥𝐨𝐲𝐞𝐝 = 𝐍𝐨|𝐃𝐞𝐟𝐚𝐮𝐥𝐭 = 𝐲𝐞𝐬)𝑷(𝐃𝐞𝐟𝐚𝐮𝐥𝐭 = 𝐲𝐞𝐬)
𝑷(𝐁𝐚𝐥𝐚𝐧𝐜𝐞 >= 𝟓𝟎𝐊, 𝐀𝐠𝐞 < 𝟒𝟓, 𝐄𝐦𝐩𝐥𝐨𝐲𝐞𝐝 = 𝐍𝐨)
𝑷(𝐂 = 𝟏|𝐝𝟏 = 𝐯𝟏, 𝐝𝟐 = 𝐯𝟐, …, 𝐝𝐧 = 𝐯𝐧)=
𝑷(𝐝𝟏 = 𝐯𝟏, 𝐝𝟐 = 𝐯𝟐, … , 𝐝𝐧 = 𝐯𝒏|𝐂 = 𝟏)𝐏(𝐂 = 𝟏)
𝑷(𝐝𝟏 = 𝐯𝟏, 𝐝𝟐 = 𝐯𝟐, … , 𝐝𝐧 = 𝐯𝒏|𝑪 = 𝟏)𝑷(𝑪 = 𝟏) + 𝑷(𝐝𝟏 = 𝐯𝟏, 𝐝𝟐 = 𝐯𝟐, … , 𝐝𝐧 = 𝐯𝒏|𝑪 = 𝟎)𝑷(𝑪 = 𝟎)
C: class e.g., default (1), not default (0)
d1, d2, …, dn: descriptive attributes
Why it is difficult to calculate
 Assuming every descriptive attribute is binary, the number of possible
combination of (d1 = v1, d2 = v2, … , dn = vn) is 2n
There are 2n+1 estimations
 Some combinations may not appear in the training data and hence no way
to estimate their probabilities
Solution
 Assuming descriptive attributes are independent of each other given C
𝑃(d1 = v1, d2 = v2, … , dn = vn|C = 1)
= 𝑃(d1 = v1|C = 1)P(d2 = v2|C = 1) … P(dn = vn|C = 1)
 Might sacrifice prediction accuracy to some extent but significantly
reduce the level of difficulty

3. Be able to use Laplace correction to estimate probabilities for Naïve Bayes


classification
Laplace correction
# of training instances belonging to class C +1
P(C) =
# of training instances +K
K: number of predefined classes

Estimating 𝑃(d1 = v1|C = 1) with Laplace correction:


[# 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑤𝑖𝑡ℎ C = 1 and d1 = v1] + 1
𝑃(d1 = v1|C = 1) =
[# 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑤𝑖𝑡ℎ C = 1] + 𝐽
J: # of values d1 can take

1
P(C) = (# of training instances belonging to class C +1) / (# of training instances +K)
K: number of predefined classes

4. Be able to understand the concept of AUC (i.e., what is AUC?) and the meaning of
special points on ROC and special ROC curve (see slide 28, lecture 8)
AUC: Area under the ROC Curve
Different cut off gives you different result even though you have the same
probability. So you need to have a more robust and more comprehensive measures for
your classification method. And this measure is called ROC (receiver operating
characteristic). So we cannot just use the 0.5 and we want to evaluate a classifier
using a group of cut-off probability. That’s why we need ROC. If you change cat off
point you will have another set of TPR and FPT. Plot a curve (red).
A graphical approach for displaying trade-off between detection rate (True Positive Rate)
and false alarm rate (False Positive Rate).
ROC curve plots TPR against FPR
 Performance of a model represented as a point in an ROC curve
 Changing the threshold (i.e., cutoff) parameter of classifier changes the location
of the point
TPR (True Positive Rate) or Recall: the fraction of actual positive instances that are
predicted as positive. TPR = TP/(TP+FN)

FPR (False Positive Rate): the fraction of actual


negative instances that are predicted as positive. FPR
= FP/(FP+TN)
(TPR,FPR):
 (0,0): predict every instance to be negative
don't make a false alarm right everything so you have 0 false
positive rate.
 (1,1): predict every instance to be positive
you made all the false alarms
 (1,0): perfect prediction You predict every
instance correct
 Diagonal line:
 Random guessing
To draw ROC curve, classifier must produce continuous-valued output
 Outputs are used to rank test records, from the most likely positive class
record to the least likely positive class record
Many classifiers produce only discrete outputs (i.e., predicted class)
 How to get continuous-valued outputs?
 Decision trees, rule-based classifiers, neural networks, Bayesian
classifiers, k-nearest neighbors, SVM

2
Calculate area below M1 and M2, M1 is better for small FPR because larger area is better
If area is close to 0.5 its useless. If it’s less than 0.5 you have to toss it because that’s
worse than guessing

Topic 2: Logistic Regression


Classification is the process of learning a model (from training data) that can assign an
instance of interest (whose class is unknown) to one of the predefined classes.
The classification process consists of two steps:
 Step 1: Learn a classification model from training data;
 Step 2: Apply the model to classify an instance with unknown class.

Instance space: plotting training instances according to their values of descriptive


attributes.
A decision tree model is essentially a set of horizontal/vertical decision boundaries that
partition a instance space into groups of homogeneous instances.

1. What is a linear model? The (graphical) difference between linear model and decision
tree.
For linear model you have one decision boundary but for decision tree you have
multiple.
Use decision boundary (line, slope) to classify data (default or no default, +es or dots)
A linear model employs a linear combination of descriptive attributes, namely decision
boundary or f(x), to classify instances. ( the objective of linear model is to learn all these w's
from training data)
 f(x) = w0 +w1 x1 +w2 x2 +…+ wn xn
 Values of w0 , w1 , w2 , … wn are learned from training data.
 A new instance (not in the training data) is classified as one class if f(x) > 0 or the
other class if f(x) ≤ 0
 Absolute values of w1 , w2 , … wn generally indicate the importance of their
respectively associated descriptive attribute in classifying an instance.
 Generally, the larger the absolute value of wi, the more important its associated
attribute in classification.

2. What is logistic regression? Formulas for logistic regression (slide 15, lecture 9)?
Why it is a linear model?
Logistic regression learns from training data to predict the probability P(Y=C|x).
3
C: class e.g., default or no default
x= (x1,x2,…xn) is a set of descriptive attributes, e.g., age, balance, employed
Example: P (Y=Default|Age<45, Balance>= 50K, Employed=No)
 Logistic regression assumes.
1
 𝑃(𝑌 = 1|𝑥) = 1+exp⁡(w0 +w1 x1 +w2 x2 +⋯+wnxn)
exp() is a exponential function.
 𝑃(𝑌 = 0|𝑥) = 1 − 𝑃(𝑌 = 1|𝑥)
exp⁡(w0 + w1 x1 + w2 x2 + ⋯ + wn xn)
=
1 + exp⁡(w0 + w1 x1 + w2 x2 + ⋯ + wn xn)
As X is large it’s closer to 1. X is negative, it’s closer to 0.
Logistic regression:
 Learns values of w0 , w1 , w2 , … wn from training data.
 Then, predicts 𝑃(𝑌 = 1|𝒙), 𝑃(𝑌 = 0|𝒙) for a new instance using the formulas in
the previous slide;
 Classifies a new instance as class 1 if 𝑃(𝑌 = 1|𝒙) > 𝑃(𝑌 = 0|𝒙) or class 0
otherwise.
WHY LINEAR?
If P(Y=0|X)/P(Y=1|X)>1 then classify x as 0.
If exp(…)>1 classify x as 0
Ln(exp(…))>Ln1 classify instanse as 0
w0+w1x1+..+wnxn>1 classify x as 0 definition of linear model
3. Comparison among the three classification methods
Functionality: All three methods (decision tree, Naïve Bayes, logistic regression) can
classify a new instance and predict the probability that a new instance belonging to a
class.
Performance: We do not know which one is better in every situation. Use cross-validation
to evaluate the performance of each method. Compare avg AUC. But depending on the
real world case and you apply 10-fold cross validation to choose the best one.
Methodology:
 Decision tree is a piecewise classifier (i.e., pick one attribute at a time) and create
many decision boundaries to partition a instance space.
 Logistic regression considers all attributes at the same time (i.e., a linear
combination of attributes) and create one decision boundary to partition a instance
space
 Naïve Bayes is not a partition approach but employs Bayes rule and conditional
independence assumption to estimate the probability a new instance belonging to
a class.
Comprehensibility:
 Decision tree visualize the produced model and it is easy to be understood by
managers without strong background in statistics and data mining. However, tree
may grow too large in size.
 Logistic regression is easy to understand for managers with background in
statistics.
 For managers, Naïve Bayes only produces a probability estimation and functions
like a black box .
Explanatory power:
 Decision tree has good explanatory power. Generally, attributes appearing in
higher levels of a tree are more important for classification than those in lower
levels.

4
 Logistic regression has strong explanatory power. Absolute values of w1 , w2 ,
… wn generally indicate the importance of their respectively associated
descriptive attribute in classifying an instance.
 For managers, the explanatory power of Naïve Bayes is low.

4. Cost-sensitive learning
Making classification decisions based on probabilities ONLY could be problematic:
 Probability (only)-based classification: classify an instance as 1 if
P(Y=1|X)>P(Y=0|X) or P(Y=1|X)>0.5
Cost-sensitive learning:
 We need to consider both probabilities and cost/utility of a decision to
minimize cost or maximize utility
 Learn probabilities from training data
 Construct cost/utility matrix from a business context

Topic 3: Support Vector Machine


1. What is the margin of a decision boundary in SVM?
Decision boundary is a hyperplane that separates one class from the other.
 if there are only two descriptive attributes: a hyperplane becomes a line
 There are infinite number of hyperplanes; each of them perfectly classifies
training examples, i.e., training error = 0
 The best hyperplane should perform the best in classifying test (unseen)
examples, i.e., minimizing test error.
 How to choose the best hyperplane: using K-fold cross validation? Not possible
to do 10-fold cross-validation, because there are infinite N of lines
 But there are infinite number of hyperplanes to choose from
The distance between H1 and H2 is the margin of decision boundary H
Blue decision boundary (hyperplane) incurs smaller test error
 intuitively, blue hyperplane has larger margin, thus more rooms for
unseen test examples
 theoretically, smaller margin hyperplane tends to overfit training data and
thus red hyperplane incurs larger test error

2. What is SVM? Slide 16 Lecture 10


Support vector machine (SVM) finds the best decision boundary by searching for the
hyperplane with the largest margin.
SVM becomes a Optimization problem:
Maximizing margin
Subject to (constraints)
all training examples above H1 belong to the “+” class
all training examples below H2 belong to the dot class

3. How does SVM handle non-separable training data and non-linearly separable
training data?

5
Real world data can be complicated: non-separable case
define a penalty for each misclassified training example (e.g., red dot)
Maximizing margin – (sum of penalties)
Subject to (constraints)
correctly classified training examples above H1 belong to the “+” class
correctly classified training examples below H2 belong to the dot class
define a constraint for each misclassified example

Real world data can be complicated: non-linear case


there is no straight line that can separate one class from the other or decision
boundary is not a line
Apply a transformation function to transform original training examples into the case
that can be separated by a line
Then apply Linear Support Vector Machine discussed in previous slides
Support Vector Machine
 has shown promising results in many practical applications
 works very well with high-dimensional data
SVM is the best way of classification (my note)

Topic 4: Bagging, Random Forest, and Boosting


 Can we create many different classifiers from one set of training data using one
classification method? YES
 Can we aggregate the predictions/classifications from different classifiers to
produce a better prediction/classification? YES

1. General procedure of an ensemble method.


D: training data; T: test data; k: number of classifiers
for i = 1 to k
Create a training sample Di from D
Learn a classifier Ci from Di using a classification method
6
end for
for each test record in T
aggregate its classifications produced by each Ci
end for
2. Bootstrap sampling
Given training data D with N records d1, d2, …,dN
Construct a training sample with N records:
Step 1: Draw one record from D with uniform distribution and put the record back to
D
Step 2: Draw one record from D with uniform distribution and put the record back to
D
….
Step N: Draw one record from D with uniform distribution and put the record back to
D

3. General procedure of bagging


Bagging: bootstrap aggregating
a. How to aggregate in Bagging:
majority voting for classification
averaging for probability estimation
b. How to create training samples in Bagging: bootstrap sampling
 Bagging can improve prediction/classification performance for unstable classification
methods such as decision tree.
 A classification method is unstable if a small change in training data can result in
large changes in learned classifier.

4. General procedure of random forest


 Random forest is designed for decision tree only, while bagging can be used
for many classification methods.
 Instead of manipulating training records in bagging, random forest focus on
descriptive attributes.
A random forest fits a number of decision tree classifiers on various sub-samples
of the dataset and uses averaging to improve the predictive accuracy and control
over-fitting.
Lecture script: RF is not about creating samples. RF is about random select of
attributes. Let’s say we have 100 dec. attributes. For building ordinary DT we
would measure every single of them to find best entropy reduction and pick the
root. This is not a case for RF. Pick let’s say k=20 attributes. Root will be one of
these attributes. You can duplicate records but not attributes. For second tree
select another random 20 attributes. Second tree will be different from the first
one. FR only applies to DT, not regression.
 General Procedure for Random Forest
D: training data; T: test data; k: number of classifiers
for i = 1 to k
Learn a decision tree Ci from D
at each node, (1) randomly select F descriptive attributes
(2) pick the best attribute to split from the selected
attributes
end for
for each test record in T
7
aggregate its classifications produced by each Ci
end for
 How to decide F
 If F is too large, learned decision trees are highly correlated
 If F is too small, the quality (e.g., prediction accuracy) of learned decision
trees is low
Rule of thumb: F = 𝑙𝑜𝑔2 (# of descriptive attributes) + 1
 Benefits of random forest
 Improve prediction performance, in comparison to a single decision tree
 Reduce the computation time of learning

5. General procedure of boosting


Same as bagging. Select 3different samples. Initially every record has the same
probability. We need to find out which record is difficult to classify and change
weights. Make a dec tree and then apply it to all records. DT is not trained on
training data, it’s trained on sample. If DT made a mistake for let’s say record 3
but correctly classified the others, increase weight for rec 3 and decrease for
others. Sum should be 1. You can also identify error rate of DT. (0.2 – 1 mistake
from 5). Take another sample with more record 3. Create another DT. Apply DT
to training data. Now it makes mistakes on record 4 and 5 and all the others are
correct. Adjust weights again. Go through the process and make more DT trees. In
bagging every DT has the same voting power and in boosting DT has diff power
based on error rate. If it’s high voting power is low.

 Bagging: manipulating training records


 Random forest: randomly selecting descriptive attributes
 Boosting: weight training records differently

Topic 5: Neural Network

1. What is perceptron? Slides 5 and 7, Lecture 12


Perceptron: models a single neuron to classify an instance
 Given descriptive attribute values of an instance, x1, x2, …, xn
 classify the instance as one class (class 1) or the other (class -1)

Summation:𝒘𝟎 + ∑𝒏𝒊=𝟏 𝒘𝒊 𝒙𝒊 = ∑𝒏𝒊=𝟎 𝒘𝒊 𝒙𝒊 if we set 𝒙𝟎 = 𝟏


𝟏 𝒊𝒇 ∑𝒏𝒊=𝟎 𝒘𝒊 𝒙𝒊 > 𝟎
Activation: 𝐨 = {
−𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
Summation – formulation of linear model. It is also linear model, that is why
percepton is linear model. If x0=1 representation is more simple
Activation - if summation is more than 0 then set it to 1, otherwise (<0) it’s -1

2. The process of learning a perceptron. Slides 8, 9 and 10, Lecture 12


 Given training examples, how to learn values of w0 , w1 , w2 , … wn
o Considering one training example x1, x2,…,xn, t (class label attribute)
o calculate f(x) =w0 x0 +w1 x1 +w2 x2 +…+ wn xn
8
o Assign perceptron output o (perceptron predicted label) as 1 if f(x) > 0 or -1 if
f(x) ≤ 0
o update wi for i=0, 1,2,..,n
⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡𝑤𝑖 =𝑤𝑖 + 𝜂(𝑡 − 𝑜) 𝑥𝑖
update wi for i=0, 1,2,..,n
𝑤𝑖 =𝑤𝑖 + 𝜂(𝑡 − 𝑜) 𝑥𝑖
(𝑡 − 𝑜) 𝑥𝑖 determines the direction of updating 𝑤𝑖 .
𝜂 is learning rate. It is usually set to small values (e.g., 0.1) and is made to decay
as the number of updates increases.
 Initially, randomly set values of w0 , w1 , w2 , … wn
 Go through all training examples many times to update values of w0 , w1 , w2 , …
wn until all training examples are correctly classified
If t=1 o=-1
𝜂>0 {if xi<0 Wi reduce
{if xi>0 Wi increase
True f(x)>0 perceptron calculation of f(x)<0 must give you something less then 0, but
actual is greater than 0. You need to increase f(x)
If t=-1 o=+1
T-o=-2
𝜂>0 {if xi<0 Wi increase
{if xi>0 Wi decrease
True f(x)<0 perceptron f(x)>0 decrease

3. How can Neural Network learns from a data set with non-linear decision boundaries?
Slide 14 of lecture 12

 Perceptron is a linear model and it is ineffective for classifying data sets with non-
linear decision boundaries.
 Solution
 Change the activate function to a non-linear function
 multilayer network

 Learning a Neural Network from training examples is to learn weights of all


edges in a Neural Network.
Process:
For each training example, calculate its outputs using a Neural Network;
Update weights based on the differences between calculated outputs and targets
Repeat the process for N (e.g., 500) times

Vous aimerez peut-être aussi